DOCUMENT RESUME 



ED 377 819 



IR 016 915 



AUTHOR 
TITLE 



INSTITUTION 
REPORT NO 
PUB DATE 
NOTE 

PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Frazier, Michael Duane 

Matters Horn and Other Features in the Computational 
Learning Theory Landscape: The Notion of 
Membership . 

Illinois Univ., Urbana. Dept. of Computer Science. 
UILU-ENG-94-1716; UIUCDCS-R-94-1858 
Apr 94 

188p.; Ph.D. Dissertation, University of Illinois at 
Urbana-Champaign. 

Dissertations/Theses - Doctoral Dissertations (041) 
— Reports - Evaluative/Feasibility (142) 

MF01/PC08 Plus Postage. 

Algorithms; *Aut omat i on ; *Coding; Computation; Data 
Collection; *Grcup Membership; ^Learning Theories; 
Problem Solving 

^Computational Learning Theory; *Horn Sentences; 
Knowledge Acquisition; Representation Language; 
Uncer ta inty 



ABSTRACT 

Computer task automation is part of the natural 
progression of encoding information. This thesis considers the 
automation process to be a question of whether it is possible to 
automatically learn the encoding based on the behavior of the system 
to be described. A variety of representation languages are 1 
considered, as are means for the learner to acquire a variety of 
types of data about the system in question. The learning process is 
abstracted as a learning problem' in which the goal is to efficiently 
collect sufficient information to identify some hidden concept using 
a particular language. The source of information about the concept is 
its relationship to some class of examples that is assumed to be 
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learning algorithms exist for two natural representation languages: 
propositional Horn sentences and the CLASSIC description logic, a 
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Computer automation of tasks is part of the natural progression of encoding information. 
When the task becomes well understood and repetitive, placing the task under computer control 
becomes a possibility. Computers were once programmed by rewiring rather than with the use 
of a modern program, management of a limited memory was once handled by the application 
programmer rather than by the operating system, and efficient use of the computer's hardware 
was once obtained by assembly language programmers rather than through a compiler. In 
other areas, accounting moved from ledger books to spreadsheets, automobile fuel intake left 
the carburetor for computer-controlled fuel injection, and diagnosis and scheduling left the 
expert for the expert system. 

Current knowledge representation research has sought to provide schemes for encoding 
knowledge about how a given system behaves, with the goal being accuracy and utility. Can an 
accurate description be given with the representation language being used? Can the resulting 
representation be manipulated easily to answer questions about the system being described? To 
the extent that both questions can be answered affirmatively for some representation language 
£, encoding information using C is well understood. Ideally, the goal of encoding knowledge is 
not the task of encoding, but the product of the encoding task. If such encodings are required 
for a variety of systems, then question of automating the process of encoding arises. 

This thesis considers this automation process to be a question of whether it is possible to 
automatically *earn the encoding based on the behavior of the system to be described. A variety 
of representation languages C are considered, as are a variety of means for the learner to acquire 
a variety of types of data about the system in question. The learning process is abstracted as 
a learning problem in which the goal is to collect efficiently sufficient information to identify 
some hidden concept C represented using the language £. The source of information about C 
is its relationship to some class of examples X that is assumed to be reasonably available even 
though C itself is not. In addition to conjecturing guesses as to the identity of C, the learner 
is permitted ask h v C relates to individuals x G X. 

The goal of inqui J>out this automation process is either to produce a learning algorithm 
that efficiently automates the encoding of any representation that uses some useful representa- 
tion language C or to show that no such learning algorithm is possible. The centerpiece of this 
thesis is that there do exist learning algorithms for two natural representation languages: propo- 
sitional Horn sentences and the Classic description logic. In addition, this thesis introduces a 
new method - consistently ignorant teachers - of modeling uncertainty in the information being 
collected. The goal of this thesis is to demonstrate that, by careful consideration of the task at 
hand, the tools that have been developed in the field of computational learning theory can be 
used to automate the process of constructing the explanations required by real-world tasks in 
fields outside computational learning theory. 
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Chapter 1 



Introduction to Computational 
Learning Theory 



Computational learning theory formalizes the notion of inferring a concept that correctly ex- 
plains observed data. As such there must be a formalization of "concept," "correct/' and 
"data". These ideas are best formalized simultaneously. 

Data exists in units called examples. A concept is a classifier; it divides the world of exam- 
ples into (generally) two groups - positive examples that exemplify the concept, and negative 
examples that do not exemplify the concept. Thus, a concept can be thought of as a boolean 
function to label examples as either positive or negative. 

To illustrate, we might take "duck," "giraffe," "elephant," "table," and "hubcap" as exam- 
ples. The concept name of an animal labels the first three as positive examples and the last 
two as negative examples; on the other hand, the concept two syllable names labels the second, 
fourth, and fifth as positive and the rest negative. 

In a learning problem we assume that a concept has been chosen and fixed, and we must 
deduce the concept based solely on the labels of the examples. Continuing the illustration, 
given that "duck" and "table" are the only positive examples in the list, after a moment of 
thought we might deduce that verb is the concept. Unfortunately, we might also deduce that 
either "duck" or "table" is the concept. 

This last observation brings us to the formalization of "correct." It is not enough for a 
deduced concept to correctly label the list of known examples, it must also accurately predict. 
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the label of examples that are as yet unseen. Thus although verb and either "duck" or "table" 
are identical in terms of the examples we know about, they are quite different faced with the 
new example "run." 

To summarize, the learning problem assumes that some hidden concept is chosen and the 
labels for the examples are assigned according to that hidden concept. The learner is expected 
to deduce from a si»<tll set of examples a concept that accurately predicts the label of every 
example (both seen and unseen), and the learner is expected to accomplish this task efficiently. 
In order to limit the possible explanations, we further assume that the hidden concept was 
selected from a set of concepts, called the concept class, that is known to the learner. We begin 
with the following formal learnability definition due to Angluin [4]. 

Definition 1 (i. Learnability) Let X be a set of examples, let C denote a concept class 
consisting of concepts expressed in some representation language C that are total boolean valued 
functions over the domain X . For any C G C let size(C) denote the number of symbols needed 
to represent C using C. Let A be an algorithm designed with full knowledge of the preceding 
items. Then A is an exact learning algorithm if there exists a polynomial p such that for any 
choice of target C* G C (with C* unknown to A) y A outputs in time p(size(C+)) a concept 
CI G C that is functionally equivalent to C* and makes at most p(size(C m )) equivalence queries, 
where an equivalence query is made by A by selecting some C G C and then being told that C 
is functionally equivalent to C+ or being provided w\th some counterexample x G X such that 
C+(x) ^ C(x). If there exists such an A } the concept class C is said to be exactly learnable. 

Another second learncbility definition, due to Littlestone [72], captures the notion of ongoing 
learning that eventually produces a correct concept but has made few mistakes along the way. 
Hence this is sometimes called mistake bounded learning or on-line learning. 

Definition 2 (On-line Learnability) // there exists a polynomial p such that for any choice 
of C m G C and any (perhaps infinite) sequence S of examples, A is called an on-line learning 
algorithm ifA t when presented the examples X{ in S y predicts the label of each X{ knowing only 
ari, . . . and their correct labeling according to C+, mispredicts at most p(sizc{C+)) of the 

examples in the sequence, and takes time at most p(size(C^)) correcting for a misprediction. If 
such an A exists, the concept class C is said to be on-line learnable. 
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When the evaluation problem (i.e. computing the value of C{x) for any C G C and any 
a: G X) is solvable in polynomial time, it is easy to see that an on-line learning algorithm that 
keeps a current C G C to use in predicting the label of the next example in Littlestone's setting 
exists if and only if an exact learning algorithm exists; the equivalence query essentially forces 
the next prediction error to be presented immediately whereas the sequence S allows the C 
used in the exact learning algorithm's equivalence query to be profitably used until the need 
arises for correction [71]. 

A probabilistic variation on the definition of learnability is due to Valiant [98]. 

Definition 3 (PAC Learnability) If there exists a polynomial p such that for any choice of 
C* G C, for any probability distribution D over X, and given any e > 0 and 6 > 0, A is called a 
probably approximately correct (PAC) learning algorithm if A sees at most p(size(C m )^ 1/e, 1/5) 
examples selected according to D and outputs in time p(sz>e(C*), 1/e, 1/5) a concept such 
that with probability at least 1-5, the probability that C*(x) ^ C+{x) on a point x chosen 
randomly according to D is at most e with respect to D. If such an A exists, the concept class 
C is said to be PAC learnable. 

Littlestone [72] shows that a PAC learning algorithm for a given concept class exists if 
an on-line learning algorithm for that class exists. The idea used is that the on-line learning 
algorithm tests its current hypothesis by taking a polynomial number of random examples. If 
the hypothesis has significant error, one of the random examples chosen will demonstrate this 
error and can be used by the on-line algorithm as a misprediction. The on-line algorithm is 
guaranteed to make only a polynomial number of mistakes, so that the result is that throughout 
the run, at most a polynomial number of random examples are witnessed before, with high 
confidence, an accurate hypothesis is produced. 

However, for concepts over examples from a continuous domain, PAC learnability does not 
imply exact learnability; the concept class {[(),£) : 0 < x < 1} is PAC learnable but not exactly 
learnable. Even in a discrete domain Blum [19], assuming the existence of one-way functions, 
gives a hardness result for on-line learnability (and thus exact learnability as well) for a PAC 1 
learnable class of functions. Thus, assuming that one-way functions exist, a positive learning 
result in the exact model is stronger than a positive learning result in the PAC model for 
boolean concept classes in which the evaluation problem is solvable in polynomial time. 
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These definitions implicitly assume the size of the representations of the examples is in- 
significant compared to the size of the representation of the target concept. This is a valid 
assumption for many problems in computational learning theory. The concept class is fre- 
quently some set of boolean formulas, and the examples are frequently bit vectors representing 
variable settings. For problems in which this assumption is not valid, such as when the concept 
class is the set of deterministic finite automata and the examples are strings labeled according 
to whether they are in the language accepted by the target DFA, the bound on the running time 
of A is often allowed to depend at most polynornially in the length of the longest example seen. 
Care must be exercised in these settings to prevent a clever ^4 from forcing an exponentially 
longer counterexample to be given in order to justify after the fact the amount of computation 
time used in identifying the target [3]. In these settings, ^4 is required at each step in its run to 
have used no more than time polynomial in the other parameters and the length of the longest 
counterexample seen to that point in the run. 1 

We will make a number of modifications to the above definitions to adapt them to a variety of 
settings. Among these modifications will be adding queries of the sort suggested by Angluin [4]; 
most important among these queries for our purposes is the membership query in wLich A is 
permitted to ask at most polynornially many questions of the form "What is C«(x)?" for 
any x £ X. We will often speak of the randomly chosen examples or answers to queries as 
being provided by a teacher, an expert, or nature. More pragmatically, these queries may be 
considered experiments designed by the learner. 

The overriding goal of computational learning theory is to devise efficient learners for natu- 
ral concept classes or to show that no such learner exists. Natural concept classes are taken to 
be those coming into existence outside the field of computational learning theory because hav- 
ing arisen outside the field suggests that the concept class has interest apart from the question 
of its learnability. For example, the field of knowledge representation has produced a number 
of languages and systems [25, 28, 39, 77, 17], each developed to model phenomena from some 
domain. To the extent that such knowledge representation languages can accurately encode 
an explanation of the phenomena for which they were intended, the question naturally arises. 



1 In the statement of our results the phrase "time polynomial in the length of the longest counterexample" 
should be taken to mean that at each step in the run of the algorithm, at most time polynomial in the length of 
the longest counterexample seen to that point has been used. 
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;l Can the encoding of the explanation be automated, i.e. learned, simply through interaction 
with phenomenon?" Generally this answer is "No;" the richness of expression within these 
natural concept classes thwarts known learning attacks. Because of this, approximations are 
made to the natural concept classes and the learnability of these approximations are considered. 
Even so, this thesis does devise exact learners for two natural kinds of classifiers - propositional 
Horn sentences (a standard choice for encoding expert system knowledge) and the description 
logic Classic (one of a collection of description logics finding many knowledge- based appli- 
cations [25]), the former using variable assignments as examples and the latter using other 
Classic descriptions as examples. 

We begin by summarizing results relevant to the new work presented here and then discuss 
some tools used to achieve these new results. 

1.1 State of the Art 

What natural classes of formulas are learnable? The most natural class of propositional boolean 
formulas, namely, the class of all boolean formulas, was shown not to be learnable given crypto- 
graphic evidence by Kearns and Valiant [66]. Carefully note that this negative result relies on 
the fact that no restriction was placed on the boolean formula representing the target boolean 
function. However, if we demand that the target boolean function be represented as, say, a 
Ar-CNF formulas (conjunctions of at most A:-literal disjuncts, for some constant fc), then learning 
algorithms exist [98, 4]. The discrepancy is that some boolean functions are represented only by 
exponentially long fc-CNF formulas so that the learning algorithm gets more time to learn the 
same function. Thus, it is not necessarily the class of functions that is hard to learn, but it can 
be simply that the class of formulas permitted to represent those functions are hard to learn. In 
other words, the choice of representation automatically provides a complexity parameter which 
may offer the learner more time than some other representation for the same function. 

The learnability question for two other natural classes, namely, CN F and DNF, was left open 
by Valiant [98] when he introduced the distribution-free or PAC criterion for concept learning. 
The learnability of these two classes remains open, and efforts to close these questions have led 
researchers to investigate the learnability of a variety of restricted classes of boolean formulas. 




Algorithms exist 2 for learning monomials (pure conjunctive concepts) [73, 98], internal disjunc- 
tive concepts [54], read-once formulas [9], monotone DNF formulas [4, 98], fc-CNF and fc-DNF 
formulas for constant k [23, 73, 99], and A;- term functions also for constant k [22]. There are also 
algorithms for linearly separable formulas [73], decision lists [88], rank k decision trees (k con- 
stant) [42], and decision trees [29], among others. Thus, a number of results have been obtained 
for approximations to a natural class of formulas. One key result in this thesis is a learning 
algorithm for a natural class of formulas, propositional Horn sentences, whose learnability was 
left open by Angluin [2] when she presented a learning algorithm for the approximating class 
of acyclic Horn sentences, 

A great deal of work in the artificial intelligence and knowledge representation communities 
deals not with propositional but with first-order concepts; as such, there is a wealth of potential 
in studying the learnability of first-order concepts. Relatively little work lias been done within 
this framework, though interest is rising sharply [70, 81, 82, 15]. Also, Dzeroski et aL [41] 
describe an algorithm that learns A:-clause, determinate, function-free, first-order Horn clauses 
with bounded depth variables by transforming the target into a propositional monotone fc-term 
DNF formula. This thesis gives a learning algorithm for another very restricted subclass of 
recursive first-order Horn sentences. The primary distinction of this work from that of Dzeroski 
et al. [41] is that the class studied h not determinate because functions can be nested to arbitrary 
depth. 3 

First order learning results have been hard to achieve; to date, the positive learning results 
for first-order concepts have been for very restricted subclasses of first-order logic. For example, 
Page and Frisch [82] looked at atoms labsled according to whether they are entailed by a 
particular kind of hidden first order formula. Other related first-order results come from the 
field of inductive logic programming where the target is a Prolog program. Cohen [35] gives a 
PAC learning algorithm for function-free, two clause, linearly recursive, closed, ij-determinant 
logic programs against a given background theory. Shapiro [95] describes a system for learning 
Prolog programs in the limit (ie, exact learning, but time is an unbounded resource) using 

2 The results use a number of different protocols; in particular some of these results use membership queries. 
The protocol used to achieve a result stated here is explained in more detail when it becomes relevant later in 
the thesis. 

3 More precisely, the classes that result from flattening (rewriting to contain no functions) the concepts are 
not determinate. 
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atoms entailed by the program to be learned. Another key result of this thesis is a learning 
algorithm for a natural first-order class, Classic, which is used in the knowledge representation 
community. This learning result is inspired by other work on Classic [36, 33]. 

An interesting question arises when changing the setting of a learning problem from the 
propositional domain to the first-order domain: How are examples represented? In the propo- 
sitional case there is a natural choice for representing examples - bit vectors specifying variable 
settings. In the first order case, the models over which the semantics of first-order logic are 
defined are frequently infinite, rendering the phrase "polynomial running time". for an algorithm 
seeing such examples meaningless. Instead of models labeled according to whether the target 
formula is satisfied, logical formulas labeled according to entailment by the target are often 
used. This choice of examples used in learning some particular, unknown body of knowledge 
suits the artificial intelligence and knowledge representation settings well in that the questions 
normally asked within those communities involve what logical implications can be efficiently 
deduced from a particular knowledge base. A third contribution of this thesis arises from car- 
rying the notion of entailment to the propositional domain to provide new results involving 
examples that are not bit vectors but propositional formulas. Relevant work here comes from 
the knowledge compilation literature [37, 52, 63, 92]. The results in this thesis include some 
sufficient conditions for learnability under entailment. 

Finally, for when the world does not operate in a well behaved way, such as when questions 
are answered not by an omniscient source but by a fallible rational source, using a representation 
closely related to Mitchell's version spaces [79], this thesis provides a model, called a consistently 
ignorant teacher, for the resultant uncertainty and considers a number of learning problems in 
such a setting. 

Interspersed among the positive learning results are a number of noii-learnability results, 
many of which show that some learnable class cannot be extended in a particular way while 
retaining learnability according to the same criteria. Other non-learnability results show that 
the types of queries used by a learning algorithm form a minimal set of queries needed to achieve 
learnability for the class. An itemized list of the contributions made by this thesis appears in 
the appendix. 
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1.2 Tools of the Trade 



1.2.1 Hardness Results 

For relating the difficulty of predicting classes of formulas - and thus for obtaining non- 
learnability results assuming the hardness of learning some class of formulas - Pitt and War- 
muth [86] provide the powerful method known as prediction-preserving reductions. For example, 
one of the reductions given by Kearns, Li, Pitt and Valiant [65] shows that PAC learning mono- 
tone CNF and DNF formulas (without membership queries) is as hard, modulo a polynomial 
time transformation, as PAC learning general CNF and DNF formulas (without membership 
queries). Soir* 5 of the negative results claimed in this thesis were obtained using this method, so 
those negative results assume that learning CNF and DNF formulas is hard; in the event that 
learning algorithms do exist for these presumably negative results, the reductions given provide 
learning algorithms for CNF and DNF, something of great value indeed. At present, however, 
the learnability of CNF and DNF formulas remains the central open problem in computational 
learning theory. 

Other negative results are based on more common hardness assumptions such as NP RP 
or the existence of one-way functions. There are also absolute (information theoretic) hardness 
results that arise from adversary arguments; these results are obtained by chosing a target and 
constructing a set of examples that reveals so little about the target that the learner is forced 
to request more than a polynomial number of these examples to accurately identify the class. 

1.2.2 Learnability Results and Prediction 

Reductions can also be used to provide positive learning results. Algorithms can be constructed 
that provide examples to an existing learning algorithm using a blind transformation of the 
examples. That is, the learning problem for concept class C is solved using a learning algorithm 
for concept class C and a polynomial time transformation on examples that does not require 
knowledge of the chosen target from C. If we require only that the learning algorithm be 
accurate and not that the algorithm produce a concept from the class C, we can use the concept 
produced by the learning algorithm for C" together with the blind transformation on examples 
as the classifier; the learning algorithm for C f is then known as a prediction algorithm for C. 



To illustrate the idea of a prediction algorithm, we present here our first new result. This 
result extends the work of Blum, Chalasani, and Jackson [20], who describe a learning algo- 
rithm for the propositional class of disjoint fc-multi-symmetric functions using membership and 
equivalence queries. 

Definition 4 Let V = {zi y . . * y z n } be a set of propositional variables. Then a disjoint fc-multi- 
symmetric function is a specification of a boolean valued k-ary function f on tuples of natural 
numbers together with a specification of k disjoint subsets S\ y ...,Sk of V . The value of f is 
obtained from a setting of the variables of V by evaluating f on the tuple (n\, . . n/ : ) where n x 
is the number of variables in Si that are set true. 

This definition requires that the k subsets be disjoint. By removing the requirement that an 
algorithm construct a representation of this form, the class of functions in which these subsets 
need not be disjoint can be predicted by Angluin's algorithm for learning deterministic finite 
automata [3]. The construction of an on-line learning algorithm is as follows. 

Blum, Chalasani, and Jackson consider / to be represented as a lookup table of (n + l) k 
entries, one entry for each possible tuple supplied to /. The value of the function / depends 
only on the count of the variables of each S{ that are assigned true. If we take a variable 
assignment and write down the variable names in alphabetical order of those variables assigned 
true, then for each i it is an easy task to construct an n + 1-stato DFA whose input symbols are 
the n variable names and whose states represent the number of variables in S{ assigned true. 
It then follows easily that a DFA having the n variable names as input symbols and having at 
most (n + l) k states can be constructed such that the states of the DFA represent the number of 
variables assigned true in any particular k such subsets 5;. By designating as accepting states 
those states corresponding to those tuples that / labels positive examples, this DFA accepts 
exactly the variable assignment that / labels positive and the DFA has essentially the same 
size as the table representing /. We will use this DFA representation of the (not necessarily 
disjoint) fc-multi-symmetric functions, and we will use Angluin's DFA learning algorithm to 
learn the DFA representation. 

We will construct a sequence of examples for the DFA learning algorithm to use from the 
sequence of examples presented for /. Occasionally, we will insert some of our own examples 
that serve to constrain the search facing the DFA learning algorithm; it is important to note 



that knowledge about / is used neither in the transformation of examples for / into examples 
for the DFA learning algorithm, nor in the selection of the extra examples inserted into the 
sequence for the DFA. 

Because the examples given to the DFA algorithm are supposed to represent an alphabetized 
list of variables set true, when the DFA learning algorithm wishes to test some DFA A for 
accuracy, we check to see that the language L(A) it accepts contains only strings in which the 
symbols appear in alphabetical order and no symbol appears twice. To accomplish this, we 
make 0(n 2 ) checks that 

L(A) n (V - {xj})*Xj{V — {xi})*XiV* 

is the empty language for every pair of symbols X{ and Xj in V with i < j. Clearly, given a;,- and 
Xj y each of these tests can be performed in polynomial time because each test can be performed 
by intersecting the 3-state DFA A/(i, j) shown in Figure 1.1 with A and checking the resulting 
language for emptiness. Equally clearly, a string in any of the non-empty languages is efficiently 
found. 

Observe that A/(i, i) accepts exactly those strings in which Xi appears more than once, and 
j) accepts exactly those strings in which Xj occurs before X{. If any language L(A)f\M(i, j) 
is non-empty, present any string in any of these languages as the next example in the sequence 
being presented to the DFA learning algorithm. 

V-{x,} V-{x t } v 

^8 

Figure 1.1: DFA schema for testing example form. 

If none of these languages is empty, then continue using A against the sequence of examples 
being presented. For each example, alphabetize the list of variables set true, preserve the 
example label, and hand the result as the next example in the sequence to be presented to the 
DFA learning algorithm. 

The matter is simpler for membership queries. When the DFA learning algorithm poses 
a membership query, answer the query "no" if the string has any repeated symbols or if th<* 
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string U not alphabetized. Otherwise, pose a membership query formed by setting exactly those 
variables in the string to true and respond to the DFA query with the answer received from a 
variable assignment membership query, which we have at our disposal. 

Because Angluin's DFA learning algorithm makes at most a polynomial number of mistakes 
regardless of the sequence it sees, we have the following result. 

Theorem 5 The class of (not necessarily disjoint) k-multi-symmetric functions is predictable 
by the on-line version of Angluin's DFA learning algorithm. 

Proof: The above discussion shows that for any (not necessarily disjoint) fc-multi-symmetric 
function /, there is a DFA A/ having size equivalent to the size of the table representation Tj of 
/. The discussion gives a polynomial time one-to-one mapping 5 between the set X of variable 
assignment examples and the set X 1 of alphabetized strings over variable the names in which 
no symbol occurs twice. It is then shown that 7/ labels eE^a positive example if and only 
if 5(e) is accepted by A/. Since the number of prediction mistakes made by the DFA learning 
algorithm is at most polynomial in the size of A/, the number of prediction mistakes is also at 
most polynomial in the size of T/. Thus the above construction produces an on-line learning 
algorithm for the class of (not necessarily disjoint) fc-multi-symmetric functions. □ 

1.2.3 Chernoff Bounds 

One standard tool used in estimating the likelihood of certain properties of a probability distri- 
bution in the PAC setting is Chernoff bounds. One formulation of these bounds, stated bekm , 
can be found in [13]. 

Fact 6 Let LE(p> m, ( I -a)pm) denote the probability that in m independent trials an event hav- 
ing probability of occurrence p is witnessed at most (1 -a)pm times, and let GE(p,m,(l + a)pm 
denote the probability that in m independent trials an event having probability of occurrence p 
is witnessed at least (1 + a)pm times. Then 

• LE{p y m, (1 - a)pm) < e-* 2 ™^ 2 

• CE{p,m,(l + a)pm) < c'« 2rnp/ * 
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1.2.4 Importance of Membership Queries 

Membership queries can make a difference - for example, monotone DNF and CNF formulas 
are PAC learnable if membership queries are available [98, 4], whereas their status is open 
without membership queries. Another example is provided by read-once formulas. These are 
PAC learnable in polynomial time if membership queries are available [9]. However, by another 
reduction of Kearns, Li, Pitt, and Valiant [65], the PAC learnability of read-once formulas 
is equivalent to that of general Boolean formulas in the absence of membership queries, and 
therefore as hard as certain apparently hard cryptographic problems, by ilie results of Kearns 
and Valiant [66]. In the exact model, read-once formulas are not learnable with equivalence 
queries alone [9] bur are exactly learnable with membership queries. 

How much are membership queries likely to help with learning general Boolean formulas, or 
general CNF and DNF formulas? Angluin and Kharitonov [10] give cryptographic evidence that 
the answer in both cases is "not much." For general boolean formulas there is cryptographic 
evidence of the same sort as given by Kearns and Valiant that they are not PAC learnable in 
polynomial time even if membership queries are available. For general CNF and DNF formulas 
the situation is more complicated, but, in effect, there is cryptographic evidence that either 
general CNF formulas will be PAC learnable in polynomial time without member, up queries, 
or they won't be PAC learnable in polynomial time even with membership queries - that is, 
the membership queries "won't help" in the case of general CNF and DNF formulas. 

Because the availability of membership queries may change the answer to the learnability 
question for some learning problems, the results claimed in this thesis include results about the 
necessity (or lack thereof) of membership queries. 

1*3 Content 

Each of chapters 2 through 7 of this thesis (except for chapter 4) represents a published work. 
Chapters 2 through 5 provide propositional results, beginning with the most intuitive model. 
Chapters 6 and 7 provide first-order results, chapter 7 discussing the less familiar concept class. 
The discussion of the significance of the results of a particular chapter appear af. the end of 
that chapter. 



Chapter 2 represents joint work with Dana Angluin and Lenny Pitt and appears as [8]. 
This chapter gives the most detailed description of the relationship of variable assignments 
to proposicional formulas. This chapter also carefully defines the class of propositional Horn 
formulas. This level of detail is omitted in subsequent chapters. The results presented in Chap- 
ter 2 include two learning algorithms for the important class of propositional Horn formulas; 
also presented are hardness results for the related class of 2-quasi Horn sentences. Sections 2.6 
and 2.7 includes material left open in [8]. 

Chapter 3 represents joint work with Lenny Pitt, which appears as [46]. This chapter builds 
directly on the work in Chapter 2, but trades the familiar variable assignment examples for 
clauses labeled according to logical entailment by a propositional Horn target. This chapter 
also presents two learning algorithms for the class of propositional Horn sentences - one a direct 
learning algorithm, and the other a reduction to either learning algorithm from Chapter 2. This 
chapter also presents hardness results for attempting to disallow membership queries. 

Chapter 4, also representing joint work with Lenny Pitt and also extending the work of 
Chapter 2, presents a learning protocol motivated by the desire to learn an efficiently applicable 
representation of the target. This desire contrasts with the common desire to learn a small 
representation of the target. The protocol assumes the time taken by an expert to answer 
a membership query is important, but does not assume that the expert is able to articulate 
his representation of the target function. Sufficient conditions for learning under this protocol 
are given and are applied to a class of formulas defined by Boros et al. [27], a class properly 
containing Horn sentences and 2-CNF. 

Chapter 5 represents joint work with Sally Goldman, Nina Mishra, and Lenny Pitt, and 
appears as [44]. A new learning model is presented in this chapter. Here the teacher is assumed 
to be rational but not necessarily omniscient. Viewed another the way, the teacher may have 
several competing models of the world and presents categorical answers to the learner only in 
those cases where the teacher's competing models agree. The learner is expected to acquire the 
teacher's models of the world. This chapter presents learning algorithms for a variety of concept 
classes, including a concept class related to the union of boxes in rf-dimensional Euclidean space. 
A hardness result is given for the case in which the teacher has several competing Horn models 
of the world. 
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Chapter 6 represents joint work with C. David Page and appears as [43]. This chapter 
presents a learning algorithm that uses no membership queries, but learns a very restricted 
first-order class of recursive Horn sentences. Here the learner sees first-order clauses labeled 
according to entailment by the target. This chapter also presents some discussion on the effect 
of relaxing some of the restrictions on the formulas in the class; a hardness result is presented 
when the restriction to unary functions is removed. A secondary result in this chapter is a new 
characterization of the regular languages. 

Chapter 7 represents joint work with Lenny Pitt and appears as [45]. This chapter discusses 
a surprising positive learnability result for the (first-order) description logic Classic. Here the 
learner collects examples labeled according to logical consequence of the target. It is shown in 
this chapter that arbitrary CLASSIC descriptions can be learned, even when the rate of malicious 
noise in the answer to membership queries is near 1/2. 
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Chapter 2 

Propositional Horn Sentences and 
Membership by Satisfaction 



We (joint work with Angluin and Pitt) show that the class of Horn sentences is exactly learnable 
using membership and equivalence queries from a standard teacher. This set of queries is 
minimal in that Angluin [4, 6] has shown that propositional Horn sentences are not exactly 
learnable from a standard teacher from membership or equivalence queries alone. Observe 
that Horn sentences are "almost monotone" in the sense that each clause contains at most one 
positive literal. As a negative result, we show via prediction preserving reduction that the class 
of 2-quasi Horn sentences - conjunctions of clauses, each clause containing at most two positive 
literals - is no easier to PAC learn using membership queries than CNF. We also show, using 
an inciteful observation by Vijay Raghavan [87], that 2-quasi Horn sentences are no easier to 
exactly learn with equivalence and membership queries from a standard teacher than CNF. 
Thus in some sense, the amount of "non-monotonicity" permitted in Horn formulas appears 
to be near the edge of learnability. Further, it is interesting to note the kind of questions 
the learner is penr tted to ask when learning Horn sentences are quite reasonable because the 
teacher can answer them in polynomial time. 

Compared to learning algorithms using membership queries to learn other concept classes, 
our Horn learning algorithm might be termed "lazy" in that membership queries are commonly 
aggressively applied to a counterexample to construct a new counterexample that has some min- 
imality property (say, a minimal number of variables assigned true). In contrast, our algorithm 
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works by keeping a sequence S of negative examples, and focuses its attention on explaining 
why those examples are negative. An example is negative only if it falsifies some clause of 
the target, and a Horn clause is falsified only if each of the variables in the antecedent of the 
clause are set to true and the variable in the consequent is set to false. Thus the algorithm 
explains a negative example in S by assuming the target contains a clause whose antecedent 
consists of exactly the variables set true by the example and that the consequent (if any) of the 
class is one the variables the example makes false. Clearly the antecedent might not contain 
all the variables set true by the example, so, in effect, the algorithm waits for another negative 
example that is explained by the same clause and throws out all the variables these two neg- 
ative examples do not have in common; further this is the only explanation that is updated. 
This, relaxed, data driven nature of this algorithm seems crucial; a number of more aggressive 
approaches that attempt to throw out as many variables as possible, or that attempt to update 
several explanations in response to new information, were foiled when a clever adversary choose 
the counterexamples. 

Because of the nice computational structure of Horn sentences, both the membership and 
the equivalence queries can be answered in polynomial time given knowledge of the target. 
Thus as a benefit, the learning algorithm can be used to reduce a given Horn sentence to 
its logical equivalent having the minimum number of distinct antecedents in polynomial time. 
To our knowledge, there did not exist an efficient procedure for finding a minimum sized Horn 
sentence equivalent to a given sentence, nor were we aware of canonical forms for Horn sentences. 
A pleasing property of our algorithm is that it produces guesses as to the identity of the target 
having a monotonically increasing number of distinct antecedents; thus the algorithm is capable 
of determining whether the function it is being asked to learn can be represented as a Horn 
sentence of a given size. This property is useful in trying to approximate a (not necessarily 
Horn) concept by a small Horn. 

2.1 The Problem 

Let V = V( , . . v n be a set of Boolean variables. A literal is either a variable v,- or its negation 
A clause over variable set V is a disjunction of literals. A Horn clause is a clause in which 
at most one literal is unnegated. A Horn sentence is a conjunction of Horn clauses. The class 
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of Horn sentences over variable set V is a proper subclass of the class of Boolean formulas over 
V. 

Let H* denote the target Horn sentence to be learned. The main result of this chapter is 
that propositional Horn sentences are exactly learnable using variable assignments as examples, 
provided membership (and equivalence) queries are available. The algorithm runs in time 1 
0(m 2 n 2 ), making 0(mn) equivalence queries and 0(m 2 n) membership queries, where m is the 
number of clauses, and n is the number of variables of H m . 

It is interesting to note that both types of queries are necessary for learning Horn sentences. 
Angluin [4] shows that membership queries alone are insufficient for polynomial-time learning, 
and, implicitly in [6], she proves that equivalence queries alone are also insufficient. 

A similar result for learning monotone formulas in disjunctive normal form (DNF) has been 
given [4, 98], The dual of the class of Horn sentences is the class of "almost monotone" DNF 
formulas — a disjunct of terms, where each term is a conjunct of literals, at most one of which 
is negated. Since our algorithm is easily modified to handle the dual class, it extends the results 
in [4, 98] by allowing a small amount of nonmono tonicity. (Later, we indicate why allowing 
more nonmonotonicity would yield a difficult problem.) Horn sentences are an interesting 
nontrivial subclass of CNF (dual: DNF) formulas, the learnability of which remains a central 
open problem for the distribution-independent ("PAC learning") model of Valiant [98]. The 
research presented here also improves the results in [2], where the class of Horn sentences is 
shown to be learnable by an algorithm that uses equivalence queries that return Horn clauses as 
counterexamples and "derivation queries" — a type of query that is significantly more powerful 
than a membership query. 

By modifying the algorithm presented here in a relatively straightforward way [4] we could 
obtain an algorithm that learns the class of Horn sentences using randomly generated examples 
as in the PAC learning model, provided that the algorithm is additionally allowed to make 
membership queries. Similarly, the algorithm presented here could be used in an on-line setting 
in which the learning algorithm is to classify each of a succession of examples, and the algorithm 
is told whether its classification is correct or incorrect before receiving each next example. The 
resulting on-line algorithm makes membership queries (excluding the examples to be classified) 

'The 0(), or ''soft-O", notation is similar to the usual 0{) notation except that 0() ignores Logarithmic 
factors. 
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but not equivalence queries, and is guaranteed to make at most a polynomial number of errors 
of classification regardless of the sequence of examples [4], 

Note that because the problem of determining whether two Horn sentences are equivalent 
(and producing a counterexample if they are not) is solvable in polynomial time, the oracle in 
our learning protocol could be replaced by a teacher with polynomially bounded computational 
resources. 

The remainder of this chapter is organized as follows. Section 2.2 gives basic definitions, 
notation, and lemmas that will be used throughout. In Section 2.3 we describe the algorithm 
and give an example run. Section 2.4 provides a correctness proof and an analysis of the 
time and query complexity of the algorithm. In Section 2.5 we present a modified version of 
the algorithm which satisfies smaller time and query bounds. We conclude in Section 2.8 by 
discussing some related and open problems. 

2.2 Preliminaries 

It is o. ;en easier to discuss the satisfaction or falsification of a Horn clause when that clause 
is represented as an implication. To expedite the discussion we will implicitly assume that all 
Horn clauses are represented as implications. This necessitates the introduction of two logical 
constants. 

Definition 7 The logical constant "true" is represented by T and the logical constant "false" 
is represented by F. 

Next, we introduce notation that will enable us to dissect Horn clauses and discuss the 
relations hips between them and examples. First, recall the identity 2VVf=i = (Af=i v i) 
where "=>" is the logical connective for implication and is a metasymbol indicating logical 
equivalence. Now, taking /\?=i v i = T (the empty conjunction evaluates to true) and adopting 



the convention that we write ViJ =>F when there are no unnegated variables in the Horn 

clause, we have the following definitions. 

Definition 8 Let II be awi Horn sentence over V . An example is any assignment x : V — * 
{T,F}. A positive (respectively, negative,) example for II is an assignment x such that II 
evaluates to true (respectively, false) when each variable v in II is replaced by x(v). 
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Definition 9 Let x be an example; then irue(x) is the set of variables assigned the value "true" 
by x and false(x) is the set of variables assigned the value "false" by x. 

By convention, T G true(x) and F G false(x). 

Definition 10 Let C be a Horn clause. Then antecedent(C) is the set of variables that occur 
negated in C. If C contains an unnegated variable z, then consequent(C) is just z. Otherwise, 
C contains only negated variables and consequent(C) is F. 

We now describe the relationships that may exist between an example and a Horn clause. 

Definition 11 An example x is said to cover a Horn clause C if antecedent (C) C true(x). We 
say that x does not cover C if antecedent(C) g true(x). The example x is said to violate the 
Horn clause C if x covers C and consequent(C) G false(x). 

Notice that if x violates C then x must cover C, but that the converse does not necessarily 
hold. 

It will be more convenient throughout the rest of the paper to consider a Horn sentence as 
a set of Horn clauses, representing the conjunction of the clauses. 

Our first observation is trivial, but it is helpful to state it formally. 

Proposition 1 Ifx is a negative example for the Horn sentence H , then x violates some clause 
of H. 

We next define the 0 operation for examples. 

Definition 12 Let x and s be two examples, then xf)s is defined to be the example z such that 
true(z) = truc[x)ntrue($). 

Note that this implies that false{xns) is false(x)U false(s). 

Lemma 13 Let x and s be examples. If x violates C and s covers C, then xC\s violates C . 

Proof: If ^ covers C then anlcccdent(C) C true(s). Also, if x violates C, then antcccdcnt(C) C 
trne(x) and consequetil(C) 6 fabc(x). Thus antcccdent(C) C triic(s)C) true(x) = lruv(sC\x) 
and consequent(C) G fahe(x) C false(s) U false(x) = falsc(snx). Thus, sC\x violates C. □ 




Corollary 14 Let x and s be examples, [fx violates C and s violates C, then xf\s violates C . 

Proof: Apply Lemma 14 after noting that if s violates C then it also covers C. □ 

Lemma 15 // x does not cover C , then for any example s, xC\s does not violate C. 

Proof: If antecedent(C) % true(x) then antecedent(C) % true(x)Htrue(s) - true{xf\s). Thus 
xC\s does not violate C. □ 

Lemma 16 If xC\s violates C, then at least one of x and s violates C. 

Proof: Observe that xf)s violates C, and thus consequent(C) G false(xDs) — false(x)Ufalse(s). 
Therefore, consequent(C) is an element of at least one of false(x) and false(s). But since xf\s 
violates C, we also have antecedent(C) C true(x) and antecedent{C) C true(s). Thus at least 
one of x and s must violate C. □ 



The ideas behind our algorithm may be understood by considering the problems that arise when 
we attempt to employ more straightforward approaches. After our algorithm is motivated in 
this manner, an example run is given. The correctness of the algorithm is demonstrated in the 
next section; in Section 2.5 we present a more efficient version of the same algorithm. 

Let //* be the target Horn sentence with respect to which equivalence and membership 
queries are answered. The algorithm is based on the following ideas. Every negative ex- 
ample x violates some clause C of //*. From x we would like to add the clause C to our 
current hypothesis, but we cannot exactly determine C from a: alone. We know however that 
antecedent(C) C true(x), and consequent(C) G false{x). Thus one approach would be to add 
to our current hypothesis // all elements of the set 



whenever a new negative counterexample x is obtained. Each clause in this set is a possible 
explanation as to why x is a negative example, and each clause guarantees that in the future x 
will be classified as a negative example. Moreover, at least one of these clauses is subsumed (i.e., 



2.3 The Algorithm 





tr£ truc{r) 



20 



33 



logically implied) by some clause of H m . However, in addition to adding a clause with the i; cor- 
ect" consequent, we may be adding several clauses with the "wrong" consequent. Fortunately, 
this problem is not of concern because any such clause that is not logically implied by the target 
formula H m will eventually be discovered when a positive counterexample is produced that does 
not satisfy the clause. At this point, at least one extraneous clause will be eliminated. 2 

Unfortunately, a simple scenario shows that this straightforward approach is inadequate 
because we can force it to add exponentially many insufficiently strong clauses. Suppose //♦ 
is defined over the variable set V — {a, 61, 62? • • •> b n } and in fact H+ is just a single clause, C, 
which is 

a=>F. 

Now any example in which a is set to T is a negative example. In particular, the example 
with a set to T and the variables {bjieven set to T is a negative example. Among the clauses 
generated from this example is 

(a A 6 2 A 64 A - ..Ab n )=>F 

Now we see a negative example which is identical to the previous negative example except that 
&i is T instead of 62- This example does not violate the first clause that we generated so we are 
obligated to exclude this negative example by generating clauses for it. Among these clauses is 

(a A bi A 64 A 6 6 A • • • A b n )^F 

which, like its predecessor, is logically entailed by II m . The difficulty is now clear: there are 
exponentially many negative examples with a set to T and exactly half of the variables {6 t } set 
to T — we are forced to exclude each one with its own clause. We have observed above that 
by including clauses(x) in the current hypothesis // upon seeing a negative counterexample 
x, some "good" clause is added to //, namely, the clause (AvGtruc(x) v)=>consequent[C), where 
C is some clause of ll m that x violates. However, the scenario above demonstrates that the 
problem with the straightforward approach is that the antecedents of the clauses of generated 

2 Stcp 8 of thn learning algorithm of figure 2.1 incorporates this idea of discarding extraneous clauses when a 
positive counterexample is produced. 



by clauses(x) are too long, so this "good" clause is less restrictive than C because its antecedent 
is more restrictive. Thus, the negative examples that fail to satisfy the (good) added clause 
may be only a small fraction of those that fail to satisfy C, and the clause added to H is only 
an "approximation" of C. Consequently, very many such approximations to the target clause 
C are generated by the examples. 

. To counter this problem, a second approach might be to find smaller antecedents by using 
membership queries to set more of the variables to F in the negative counterexamples that 
we are given by any equivalence query. Thus, given a negative counterexample x we set some 
variable which is currently T in x to F and ask whether the result satisfies H+. If it does not, 
then the result still violates some clause of so we leave the variable set to F, otherwise we 
set the variable back to T. We repeat this process until we can set no more variables to F. 
This process can be done quickly and we are left with a minimal negative example. However, 
a second scenario will disclose the problem with this approach. 

Suppose H+ is defined over variable set V = {a,6, c, d} and in fact H+ is 

(b A c=>d) A {b=>a) 

Further suppose that we minimize negative examples by trying to set the variables to F in 
alphabetical order. Let TTTF (that is, the variables a, 6, and c set to T and the variable d set 
to F) be the first negative counterexample. We minimize this to FTFF and add clauses( FTFF) 
to our hypothesis, so H becomes 

(6=>a) A (6=»c) A (b=>d) A (6=>F) 

Next we see the positive counterexample TTFF which simply reduces // to 

(6=>a) 

Thus we have indeed found a clause of //«,. Next we see the negative counterexample TTTF. 
But this is the same example that we saw at the beginning, and so our algorithm will never 
terminate; it is forced to find the same clause repeatedly. The difficulty lies in the fact that even 
though we were given an example that violated the first clause of //,, our minimization produced 
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an example that violated the second clause of and this fact led to non-termination. (One 
might object that it is foolhardy to decide a priori the order in which we will try to set the 
variables of our negative examples to F; rather we should dynamically decide the order in which 
to minimize a new negative counterexample. However, it appears to be a difficult problem to 
design a polynomial-time algorithm that is guaranteed to minimize a negative example and 
simultaneously avoid rediscovering any of the previously found "good" clauses even if is known 
which clauses in the current hypothesis are genuinely "good".) 

As demonstrated by the first scenario, we must reduce the number of variables set to T 
in the negative counterexamples we are given, but as seen in the second scenario we cannot 
do this in the obvious way. A data-driven approach solves the dilemma, A new negative 
example is used to attempt to "refine" previously obtained negative examples by intersection 
(bitwise conjunction) - each such intersection, if it actually contains fewer true variables than 
the previously obtained negative example, is then tested to see whether it is negative (using a 
membership query,) If so, it is a candidate to refine the previously obtained negative example. 

The algorithm maintains a sequence S of negative examples. Each new negative counterex- 
ample is used to either refine one element of 5, or is added to the end of 5, In order to learn 
all of the clauses of //*, we would like the clauses induced by the (negative) examples in S 
to approximate distinct clauses of //*. This will happen if the examples in S violate distinct 
clauses of //*. Overzealous refinement may result in several examples in S violating the same 
clause of //*, To avoid this, whenever a new negative counterexample could be used to refine 
several examples in the sequence 5, only the leftmost among these is refined. 

By collecting these ideas, the learning algorithm (Figure 2.1) can be described intuitively in 
the following way. The sequence S of negative examples that the algorithm maintains is used 
to generate new hypotheses. Each negative example in the sequence can be explained by 0(n) 
different Horn clauses, and each of these possible explanations is placed in the hypothesis. Any 
clause in the hypothesis which is not logically entailed by the target will be exposed eventually by 
a positive counterexample. When this positive counterexample appears, the algorithm removes 
any clause that this example violates from the hypothesis. On the other hand, the hypothesis 
may also contain some clauses which are insufficiently strong. These clauses will be exposed 
eventually by a negative counterexample. When this negative counterexample occurs, the 
algorithm refines the first element of S it can (using an intersection and a membership query) 
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or it appends the new negative example to the end of S. In either case, after modifying S 
the algorithm generates a new hypothesis from S. In Section 2.4 it is proved that this process 
produces in polynomial time a hypothesis which is logically equivalent to the target. 

A simulated run of the learning algorithm will now be given. Suppose H m is defined over 
the variable set V = {a,i>,c,d} and in fact H+ is 

ff* : (a A c=>d) A (a A 6=>c) 

Initially we set S to be the empty sequence and H to be the null hypothesis 

S:[] 

H :0 

Suppose the first counterexample to our equivalence query for H is the negative example TTTF. 
There are no elements of S that we can attempt to refine with this negative example, so we 
simply append it to the end of the sequence, and since S changed we generate a new hypothesis 
H by conjoining all of the clauses(s) for all 5 £ S. Thus, 

S : [TTTF] 

// : (a A b A c^d) A (a A 6 A c=>F) 

Now suppose the next counterexample to our equivalence query for // is the positive example 
TTTT. This eliminates an extraneous clause from // but does not change S, so we do not 
generate a new H from S. Thus we now have 

S : [TTTF] 

// : (a A 6 A c=>d) 

For clarity, we will assume for the remainder of this simulated run that we are able to discard 
immediately any extraneous clauses by positive counterexamples returned from equivalence 
queries for //, and so we will only show the effect of receiving a (necessarily) negative coun- 
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1 Set S to be the empty sequence /* Si denotes the i-th element of S. */ 

2 Set H to be the empty hypothesis 

3 UNTIL equivalent ( H ) returns "yes" DO 

4 /* main loop */ 

5 Let x be the counterexample returned by the equivalence query 

6 IF x violates at least one clause of H 

7 THEN /* x is a positive example */ 

8 remove from H every clause that x violates 

9 ELSE /* x is a negative example */ 

10 BEGIN 

11 FOR each 5,- in S such that true(sif)x) is properly contained in true(si) 

12 . BEGIN 

13 query members-Da;) 

14 END 

15 IF any of these queries is answered "no" 

16 THEN let i be the least number such that member(s,nx) was answered "no v 

17 refine $i by replacing ${ with $if)z 

18 ELSE add x as the last element in the sequence S 

19 ENDIF 

20 Set // to be f\ 3 ^ s clauses(s), where clau$es{s) = {( Ave«rue(s) 'o)=>z : z £ 
falsc(s)} 

21 END 

22 ENDIF 

23 END /* main loop */ 

24 Return // 

Figure 2.1: Algorithm for Learning Horn Sentences. 
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terexample. Suppose the next negative counterexample is TTFT. We can intersect this with 
the (first) element of S and get a negative example with strictly fewer variables set to T than 
the (first) element of S had. We do so, replacing this element of S with the result of the inter- 
section. Then, because S changed, we generate a new hypothesis H from S. (Again assuming 
that all extraneous clauses were subsequently eliminated) we have 

S : [TTFF] 

H : (a A b=>c) 

Suppose our next negative counterexample is TTTF. We cannot intersect this with any element 
of S and get a negative example with strictly fewer variables set to T that that element of S 
currently has, so we add our new negative counterexample to the end of the sequence 5. This 
changes £, so we regenerate H by conjoining all of the clauses(s) for all s € S. This eventually 
leaves us with 

S : [TTFF, TTTF] 

H : (a A 6=>c) A (a A 6 A c=>d). 

Now suppose our final negative counterexample is TFTF. We cannot refine the first element 
of S with this negative example, but we can refine the second element of S, so we replace the 
second element of S by the intersection of that element and our negative counterexample. This 
in turn mandates that we regenerate H from S, which eventually leaves us with 

S : [TTFF, TFTF] 

// : (a A b=>c) A (a A c=>d) 

Our final equivalence query for // tells us that we have learned //*. 

Note that the first and fourth negative counterexamples were the same, namely TTTF; thus 
the current hypothesis held by the algorithm is not necessarily consistent with the examples 
seen so far. 
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3.4 Correctness and Running Time 

We prove that the algorithm of Figure 2.1 correctly terminates in time 0(m 3 n 4 ), making 
0{m 2 n 2 ) equivalence queries and 0(m 2 n) membership q series. In the next section, a more 
efficient algorithm improves these bounds to 0(m 2 n 2 ) y O{mn), and 0(m 2 n) y respectively. 

First observe that the algorithm terminates only if the hypothesis and the target Horn 
sentence //* are equivalent. Therefore, if the algorithm terminates, it is correct. To show 
termination in polynomial time we first prove a couple of technical lemmas. 

Lemma 17 For each execution of the main loop of line 3, the following holds. Suppose that in 
step 5 of the algorithm a negative example x is obtained such that for some clause C of H m and 
for some s : - £ S , x violates C and s : - covers C . Then there is some j < i such that in step 17 
the algorithm will refine Sj by replacing Sj with Sjf)x. 

Proof: The proof is by induction on the number of iterations k of the main loop of line 3. If 
k = 1, then the lemma is vacuously true, since the sequence S is empty upon execution of step 
5. Assume inductively that the lemma holds for iterations 1,2, . . . ,Ar - 1 of the main loop, and 
assume that during the k-th execution of the loop, at step 5 a negative example x is obtained 
such that for some clause C of H* and for some s,- £ S y x violates C and s : - covers C . Clearly, 
if in step 17 of the Ar-th iteration, the algorithm refines some Sj where j < i, then we are 
done. Suppose that this does not happen. Now by Lemma 13, we know that SiCix is a negative 
example. It only remains to be shown that true(siDx) is properly contained in true(s t ), for 
then Si will be refined in step 17. Observe that each time the sequence S is modified, step 
20 of the algorithm discards the old hypothesis and constructs a new hypothesis // from the 
elements currently in S. Further observe that during each execution of the main loop of line 
3, either S is modified (lines 9-21), or else a clause is removed from // (line 8). Let j < k be 
the last execution of the main loop of line 3 during which S was modified. Then, during the 
j-th iteration, line 20 was executed and // was reconstructed from S. At this time a clause 

- {f\veiruc{ 3l ) v ) =>consequent(C) was included in //, where C is the clause that x and 
both violate. Now C logically implies C, so C could not have been removed in line 8 during 

iterations j+\ A: of the main loop. Since the equivalence query returns only examples in the 

symmetric difference between the hypothesis // and the target // M a negative example obtained 
in line 5 satisfies every clause of //. By assumption, ,c violates thus ronscquvnf(( ') € fulsc(x). 
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But now if Irm i*, J '.] true(x), then a: would violate C, a contradiction. Therefore, true{s{f\x) 
is properly contained in true{si). Thus the algorithm will replace by SiHa: in line 17. □ 

Lemma 18 Let S be a sequence of elements constructed for the target by the algorithm. 
Then 

1. VA:V(2* < fc)V(C G #*) if Sk violates C then Si does not cover C 

2. VfcV(i 7^ fc)V(C G H*) if Sk violates C, then s,- does noJ violate C. 

Proof: The proof is by induction. We will show that properties 1 and 2 are preserved under 
any modifications the algorithm makes to the sequence 5. 

Initially the sequence is empty, so both properties hold vacuously. Now suppose that the 
properties hold for some sequence, and suppose that the algorithm modifies the sequence in 
response to seeing the negative example x. 

If the algorithm appends x to the sequence as, say, 5 f , then suppose by way of contradiction 
that property 1 fails to hold. Inductively, the only way that property 1 could now fail to hold is 
if there is some i < t such that S{ covers some clause C of 72* that St violates. But this means 
SiC\x violates C This together with Lemma 17 contradicts the fact that the algorithm did not 
replace Sj by SjOx for some j < i. Thus property 1 is preserved. 

Now suppose by way of contradiction that property 2 fails to hold. Inductively, the only way 
property 2 could now fail to hold is if there is some i < t such that S{ and s t both violate some 
clause C of //*. Since s t - violates C it also covers C. Then, by Lemma 17, some Sj with j < i 
would have been refined instead of x = s t being added to 5, a contradiction. Thus property 2 
is preserved. 

Now suppose that instead of appending x to the sequence, the algorithm replaces some 
$k with 5fcHa:. Suppose by way of contradiction, that property 1 fails to hold. There are two 
possibilities, either there is some i < k such that Sj covers and s^fla; violates some particular 
clause C of H* or there is some i > k such that s^x covers and s t - violates some particular 
clause C of //*. If the former case holds, then by Lemma 16 either x violates C or $k violates 
C. If x violates C then (since s t - covers C) by Lemma 17 there must be some j < i < k such 
that sj was refined' instead of a*, a contradiction. On the other hand, if violates (7, then the 
fact that Si and both violate C contradicts the inductive assumption that property 2 held 
before the modification. 
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Now consider the latter possibility, namely that there is some i > k such that s^dx covers 
and S{ violates some clause C of H+. Then by (the contrapositive of) Lemma 15, s/^ covers C. 
Since Si violates C and i > fc, this contradicts the inductive assumption that property 1 held 
before the modification. Thus, property 1 is preserved. 

Finally, suppose that the algorithm replaces some Sk with Skf\x and suppose by way of 
contradiction that property 2 no longer holds. If this is the case, then there is some i ^ k 
such that Si and s^f\x both violate some particular clause C of H+. By Lemma 15, s^ covers 
C. Further, by Lemma 16, at least one of Sk and x must violate C. If Sk violates C, then the 
inductive assumption that property 2 held before the modification is contradicted by the fact 
that Si also violates C. On the other hand, suppose x violates C. If i > k y then the fact the Sk 
covers C contradicts the inductive assumption that property 1 held before the modification. If 
i < k y then Lemma 17 and the fact that Si violates (and hence covers) C contradicts the fact 
that the algorithm did not replace sj by SjDx fcr some j < i f Thus, property 2 is preserved. □ 

Corollary 19 At no time do two distinct elements in S violate the same clause of H*. 

Proof: This is property 2 of Lemma 18. □ 

Lemma 20 Every element of S violates at least one clause of H ¥ . 

Proof: Each of the elements in S is a negative example, thus by Proposition 1, each of the 
elements violates some clause of //*. □ 

Lemma 21 // H ¥ has m clauses, then at no time are there more than m elements in the 
sequence S. 

Proof: This follows immediately from the fact that each of the elements in S violates some 
clause of II ¥ but no two elements violate the same clause of □ 

Finally, we have our theorem. 

Theorem 22 A Horn sentence consisting of m clauses over n variables can be exactly learned 
in time 0{rn*ri { ) using 0(rn 2 n 2 ) equivalence queries and 0(m 2 n) membership queries. 

Proof: The only changes to the sequen.ee S during any run of the algorithm involve either 
appending a new element to ,5\ or refining an existing element. Thus \S\ cannot decrease 



during any execution of the main loop of the algorithm. But Lemma 21 shows that there are at 
most m elements of S at any time. Thus line 18 is executed at most m times. Now observe that 
whenever any element s% of the sequence S is refined (line 17), the resulting new *-th element is 
S{ fix, which, by line 11, must contain strictly fewer, variables assigned the value "true" than s : . 
This can happen at most n times for each element of S. Thus line 17 is executed at most nm 
times. Whenever the ELSE clause at line 9 is executed, either line 17 or 18 must be executed. 
It follows that lines 9-21 are executed at most nm + m — (n + l)m times. Note that this bounds 
the total number of membership queries made by (n + l)m 2 . 

Next observe that for any element s of 5, the cardinality of false(s) is at most n + 1 (recalling 
that F 6 false(s)). Thus the cardinality of clauses(s) is at most n + 1. Therefore, the number 
of clauses in any hypothesis H constructed in line 15 is at most (n + l)m. Now, since each 
positive counterexample obtained in line 5 necessarily causes at least one clause to be removed 
from H by line 8, the equivalence query can produce at most (n + l)m positive counterexamples 
between modifications to S. Therefore, line 8 is executed at most (n + l) 2 m 2 times. Since each 
execution of line 3 that does not result in termination causes execution of line 8 or lines 9-21, 
the total number of executions of line 3 (and hence the total number of equivalence queries 
made) is at most {n + l) 2 m 2 + (n + l)m + 1. 

To complete the proof we need only show that the time needed for each execution of the 
main loop is 0(n 2 m). Using the facts (above) that at any time during the execution of the 
algorithm \S\ < m and \H\ < (n + l)m, and that each element of H consists of at most n + 1 
variables (antecedent -f consequent), it is easily verified that the time needed to execute either 
of steps 8 and 20 is 0(n 2 m), and that these steps dominate the time to execute one iteration 
of the main loop. □ 

2*5 Improvements to the Algorithm 

We describe a more efficient version of the learning algorithm for Horn sentences. There is 
a natural shorthand notation for propositional Horn sentences obtained by gathering up all 
the clauses with the same antecedent set and conjoining the consequents. The conjunction of 
several clauses Ci, . . . , C* with the same antecedent will be represented as a meta-clause whose 
antecedent is the common antecedent of the clauses and whose consequent is the conjunction 
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of consequent(Ci) for i = 1, . . . , k. For example, the meta-clause 

(a A b A d=>c A e) 

is logically equivalent to, and will be used to represent, the conjunction 

(a A b A d=>c) A (a A 6 A d=>e). 

The new version of the algorithm maintains the current hypothesis as a sequence of meta- 
clauses, one meta-clause corresponding to each negative example in the sequence S in the 
previous version. We assume that this representation is used both by the algorithm and for 
the equivalence queries. (If the equivalence queries require that the representation be strictly a 
conjunction of Horn clauses, further (straightforward) optimizations must be made to achieve 
the time bounds below,) 

In addition to this shorthand representation, we make use of the observation that once a 
positive counterexample eliminates a clause, it eliminates any clause with a refined antecedent. 
For example, the positive counterexample TTTF eliminates the clause (a A b A c=>d), and also 
refinements like (aAb=>cl) and (bAc=>d). Thus, when we refine the antecedent of a meta-clause, 
we do not need to re-introduce possible consequents that have been eliminated. 

The effects of counterexamples on meta-clauses can be exemplified as follows, starting with 
the meta-clause 

(a A b A c=>dAe A /). 

A positive counterexample causes items to be struck from the consequent, for example, a positive 
counterexample TTTTFT would result in the meta-clause 

(a A b A c=>dA /). 

A subsequent negative counterexample that refines the corresponding negative example moves 
variable(s) from the antecedent to the consequent. For example, the negative example TTFFFF 
then results in 

(a A 6=>cA d A /). 
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Using suitable data structures, this means that the total processing time spent modifying the 
hypothesis is 0(mn), since each variable can appear in the antecedent, be moved to the conse- 
quent, and be deleted, from each meta-clause. A more formal treatment follows. 

We use the partial ordering < on assignments defined by x < y if and only X{ < yi for 
i = l,...,n. If C is a meta-clause, let negex(C) denote the example that assigns T to all 
the variables in the antecedent of C and F to all the other variables. Then negex(C) is the 
minimum example in the ordering by < that violates C. In the new version of the algorithm, 
H is the conjunction of a sequence of meta-clauses C{ such that negex(Ci) = s t \ 

We define three meta-clause operations: generating a new meta-clause from a negative 
example (new(x))> reducing a meta-clause with a positive counterexample (reduce(C,x)), and 
strengthening a meta-clause with a negative example (refine(C,x).) 

Given a negative example x, define new(x) to be the aieta-clause whose antecedent is pre- 
cisely the set of variables assigned T by x and whose consequent is F. Note that negex(new(x)) = 
x. For example, 

new(TFTFT) = (a A c A e=>F). 

This operation is used to construct an initial meta-clause from a new negative example. It re- 
places the operation clauses(x). The consequent F is introduced first because the Horn clause 
with antecedent set A and consequent F logically implies every other Horn clause with an- 
tecedent set A. The other possible consequents are only introduced if and when the consequent 
F is eliminated by some positive counterexample. 

Given a meta-clause C and an example x > negex(C) for which C is false (intuitively, x is a 
positive counterexample) we define reduce(C y x) to be a meta-clause with the same antecedent 
as C and consequent defined as follows. 

1. If the consequent of C is the logical constant F, then the consequent of reduce(C y x) is the 
conjunction of those variables V{ such that V{ is assigned F in negex(C) and V{ is assigned 
T in x. 

2. If the consequent of C is not the logical constant F, then the consequent of reditce(C\x) 
is the conjunction of those variables in the consequent of C that are assigned 1 in x. 

For example, 

reducc((a A fr^F^TTTFT) = (aAbcAc). 
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Also, 

reduce{(a A b=>c A d), TTTFT) = (a A 6=>c). 

Given a meta-clause C and an example x < negex(C) (intuitively, x is a negative counterex- 
ample) we define refine(C^x) to be a meta-clause whose antecedent is all the variables assigned 
T by x, and whose consequent is defined as follows. 

1. If the consequent of C is the logical constant F, then the consequent of rejine(C> x) is the 
logical constant F. 

2. If the consequent of C is not the logical constant F, then the consequent of refine(C,x) is 
the conjunction of all the variables in the consequent of C and all the variables assigned 
F by x and Tby negex(C). 

For example, 

refine((a A 6 A c=>F),TFTFF) = (a Ac^P). 

Also, 

refine{{a A b=>c A e), TFFFF) = (a=>b A c A e). 

Note that the possible consequent d is not re-introduced here, having been (presumably) elim- 
inated by a previous positive example. 

The new version of the algorithm, shown in Figure 2.2, has the same line numbers as the 
previous version, for ease of comparison. As for the previous version, if the algorithm halts 
then its output is correct, so we need only give bounds on its running time. 

Since throughout the new version negex(d) is equal to s,- in the old version, the same 
argument shows that if m is the number of clauses of the target //*, then there are at most m 
meta-clauses in // at any time. Note that no meta-clause is ever deleted from //. 

Consider the career of a particular meta-clause C*. C{ is initially created in response to a 
negative counterexample. When d is first created, its antecedent consists of a conjunction of 
variables, and its consequent is the logical constant F. 

The antecedent of d then only changes in response to negative counterexamples, and the 
change must be to delete one or more variables from the antecedent. Thus, there can be at most 
n such changes. Since every negative counterexample must either cause the creation of a new 
meta-clause or refine an existing one, there can be at most m(n + 1 ) negative counterexamples. 



NewHORN 

1 /* H is a conjunction of meta-clauses Ci */ 

2 Set # to be the empty hypothesis 

3 UNTIL equivalent(tf) returns "yes" DO 

4 BEGIN /* main loop */ 

5 Let x be the counterexample returned by the equivalence query 

6 IF a: violates at least one meta-clause of H 

7 THEN /* x is a positive example */ 

8 replace C{ by reduce(Ci>x) for every C{ that x violates 

9 ELSE /* x is a negative example */ 

10 BEGIN 

11 FOR each Cy in // such that (negex(Ci)nx) < negex(Ci) 

12 BEGIN 

13 query member(negex(Ci)r\x) 

14 END 

15 IF any of these queries is answered "no" 

16 THEN let i be the least number such that member(negex(Ci)C\x) is 
answered "no" 

17 replace d by refine(Ci y negex(Ci)C\x) 

18 ELSE add new(x) as the last meta-clause in // 

19 ENDIF 

20 /* H is already updated */ 

21 END 

22 ENDIF 

23 END /* main loop */ 

24 Return // 

Figure 2.2: New Version of Algorithm for Learning Horn Sentences. 
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The consequent of C, may change in response to negative or positive counterexamples. The 
first positive counterexample to d changes its consequent from the logical constant F to the 
conjunction of a subset of the variables not in the antecedent of C{. Negative counterexamples 
before the first positive counterexample do not change the consequent of d - it remains F. 
Subsequent negative counterexamples to C{ may move one or more variables from the antecedent 
to the consequent of C,-. Positive counterexamples to C,-, after the first, can only remove 
variables from the consequent of (7,-. Every variable can be deleted at most once from the 
consequent of C<-, and, if the consequent is not the logical constant F, at least one variable 
must remain in it. Thus, d can be changed by at most n positive counterexamples. Since 
every positive counterexample must change one or more meta-clauses, there can be at most m>z 
positive counterexamples. 

Since each negative counterexample can cause at most m membership queries, no more 
than m 2 (n + 1) membership queries will be made. With a straightforward representation of 
meta-clauses and assignments as lists of length n, this algorithm can be implemented to run in 
time 0{m 2 n 2 ). 

Our analysis of the improved algorithm establishes the following theorem. 

Theorem 23 A Horn sentence consisting of m clauses over n variables can be exactly learned 
in time 0{m 2 n 2 ) using 0(mn) equivalence queries and 0(m 2 n) membership queries. 

2.6 Compression 

It is interesting to note that because the membership and equivalence queries are answerable 
by a polynomially time bounded teacher, it is possible to apply the Horn learning algorithms 
to obtain a polynomial time algorithm that given an arbitrary Horn sentence produces an 
equivalent Horn sentence whose size is at most linearly larger than the smallest equivalent 
Horn sentence. 

Corollary 24 There exists a polynomial time algorithm that given any Horn sentence H over 
n variables produces an equivalent Horn sentence IV whose size is 0(n) times the size of the 
smallest equivalent Horn sentence. 

Proof: The learning algorithms presented here find a representation with the fewest distinct 
antecedents, because each negative example violates some clause of the target, there are no 



more negative examples in S than there are clauses in the target, and a single antecedent is 
constructed from each negative example in S. For each antecedent, every consequent supported 
by the target is added; there are only 0(n) possible consequents for a given antecedent. Thus, 
the size Horn sentence constructed by the learning algorithms is at most 0(n) times the size of 
the smallest representation. 

Now, since membership and equivalence queries are answerable in polynomial time, given 
H as input, we can construct a polynomial time algorithm to answer queries about H posed 
by the learning algorithms instead of requiring a teacher. The Horn sentence H l produced by 
the learning algorithm against this substitute teacher is sufficiently small. □ 

2.7 Hardness Results 

If membership queries are not available, it is an open problem whether Horn sentences are PAC 
learnable or polynomial-time predictable (from random examples alone.) By the reductions 
of Kearns, Li, Pitt, and Valiant [65] PAC learnability of Horn sentences would imply PAC 
learnability of general CNF and DNF sentences, and similarly for polynomial predictability. 

It is also an open problem whether general CNF or DNF formulas are learnable or poly- 
nomially predictable when equivalence and membership queries are available. For any k y let 
fc-quasi-Horn be the class of CNF formulas where each clause contains at most k unnegated 
literals. Thus 1-quasi-Horn is just the class of Horn sentences, and is learnable using equiv- 
alence and membership queries. Here > using a clever observation by Vijay Raghavan about 
Pitt and Warmuth's requirements for prediction-preserving reductions [86], we show that if an 
algorithm exists to learn the class of 2-quasi-Horn formulas using equivalence and membership 
queries then the general class of CNF formulas (and DNF formulas) would be learnable by an 
algorithm that uses membership and equivalence queries. 

Corollary 25 If the class of 2-quasi-Horn formulas is learnable using membership and equiv- 
alence queries, then the class of CNF formulas is learnable using membership and equivalence 
queries. 

Proof: Let the class of CNF formulas be defined over the variable set {x [y . . . Introduce 
new variables {y { , y n ) such that the negation of y x will behave like x*,. Let C* be the 
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target CNF formula, and everywhere the variable X{ appears replace it with the negation of 
yi. Notice that this produces a CNF formula (over the variables U {yi}?- x ) that has no 

unnegated literals, and is thus Horn. Obtain a CNF formula CI by conjoining to this formula 
the n clause pairs (x,- V yi) A ( V -*yi), which require that exactly one of X{ and yi is set to T 
in any example that satisfies C*. Note that CI is a a 2-quasi-Horn formula whose size is only 
polynomially larger than C*. 

Now consider extending any example e of C* (over n variables) to an example e' of C+ (over 
2n variables) by setting the value of y,- to be the negation of the setting of X{. The example 
x satisfies C* if and only if x' satisfies C*. The objective here is that because such a 2-quasi- 
Horn CI exists for any C*, we can employ a 2-quasi-Horn learning algorithm ^4 to identify C* 
by conditioning all examples as above and pretending that the extant CI is the actual target. 
Clearly, this conditioning can be accomplished without knowing C*. 

Now, when A poses a membership query, if every X{ is set oppositely of its corresponding 
yi, then make a membership query about C* using the settings of the X{ and return that answer 
to A Clearly, the answer to these two membership queries must be the same. On the other 
hand, if some X{ and y : - are set identically, answer the query "no" - this setting violates one of 
the clauses of the hypothetical CI that demands that the X{ and yi be set oppositely. Thus, 
answering membership queries correctly for CI does not require knowledge of C*. 

When ^4 poses an equivalence query on the hypothesis //', we know that, if correct, the 
variables x,- and y,- must behave as negations of one another in the hypothetical C*. Thus, obtain 
H by replacing every variable y x by the literal -*Xi and make an equivalence query against C+ 
about //. If the equivalence query is answered "yes" we are done because // is a CNF over 
{ x :}r=i tnat logically equivalent to C*. If ihe equivalence query is answered "no" and an 
example e (o/er n variables) is returned, then e satisfies C* if and only if e falsifies //. Extend 
e to an example e' (over 2n variables) as described above. Because // was obtained from //' 
by replacing y, with e' satisfies //' if and only if e satisfies //, so that e satisfies C* if and 
only if e' falsifies //'. But e satisfies C* if and only if e f tisfies the hypothetical C' m . so e' 
satisfies C" if and only if e' falsifies //'. That is to say, e' is a counterexample to //' for the 
hypothetical target CJ,. Therefore, supplying A with the counterexample e properly answers 
the equivalence query on //' "no". Thus, answering equivalence queries correctly for does 
not require knowledge of C m . 
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It now follows that the above construction produces a learning algorithm for the class of 
CNF formulas from membership and equivalence queries, provided that A exists. □ 

The following PAC related corollary, whose proof borrows ideas from the previous corollary, 
also holds. 

Corollary 26 If the class of 2-quasi-Horn formulas is PAC learnable using membership queries, 
then the class of CNF formulas is PAC learnable using membership queries. 

Proof Sketch: As in the previous corollary, a 2-quasi-Horn target (over twice as many 
variables) exists for a given CNF formula. The simulation now is less complicated than before. 
The randomly generated examples received are transformed into examples of the 2-quasi-Horn 
as before, the membership queries posed by the 2-quasi-Horn algorithm are answered as before, 
and there are no equivalence queries to answer. □ 

Finally, based on the Angluin and Kharitonov result [10], we immediately have the following 
corollary. 

Corollary 27 Assuming the existence of one-way functions, if the class of 2-quasi-Horn for- 
mulas is PAC learnable using membership queries, then the class of CNF formulas is PAC 
learnable from random examples alone (without membership queries). 

Proof: The Angluin and Kharitonov result says that under the assumption of the existence 
of one-way functions, membership queries do not help when trying to predict CNF in the PAC 
setting. □ 

2.8 Discussion 

An exact learning algorithm for propositional Horn sentences using membership and equivalence 
queries was presented. By the results of Angluin [6, 4] neither type of query alone is sufficient 
to allow exact learning in polynomial time. The algorithm may be used to obtain an algorithm 
for PAC learning or polynomial prediction [59, 86] of Horn sentences from randomly generated 
examples, provided that membership queries are also available to the algorithm. 

An interesting open problem is whether the algorithm here can be extended so as to handle 
restricted types of universally quantified Horn sentences (see the papers of Valiant [99] and 
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Haussler [57] for related classes of formulas). Although this answer appears to be "no", this 
class is of significant interest due to its similarity to the language Prolog, and its use in logic 
programming and expert system design. The pessimism arises from the dramatic change in 
semantics between propositional and first-order logics. Specifically, the obvious first-order ana- 
log of a propositional example is a potentially infinite table specifying truth assignments to 
every object in the Herbrand universe of the target formula. Infinite sized examples render 
"polynomial time" meaningless. Even so, in chapters 6 and 7, after discussing alternative types 
of examples, we will discuss polynomial time learning algorithms for two first-order classes. 
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Chapter 3 

Propositional Horn Sentences and 
Membership by Logical Entailment 

To motivate the distinction between teachers providing standard variable assignment examples 
and teachers providing formulas labeled according to logical entailment, consider the following 
two learning tasks involving elephants. The unknown concept of an elephant might be expressed 
as a boolean formula of a particular form denned over variables which denote real-world at- 
tributes. For example, the formula large A grey A has-trunk A eats-peanuts might represent 
exactly the set of all elephants, to the extent that an assignment to the above boolean variables 
represents an elephant if and only if it satisfies the conjunction. 

A related, but perhaps philosophically different problem, is obtained by viewing the target 
as the description of the only world the learner has at his disposal to explore. Here, an un- 
known formula may be viewed as a theory about the world, and positive examples a will 
be statements which follow from the theory. For example, if the theory contained the impli- 
cations that "large mobile things can crush you," "elephants are large things," and "elephants 
are mobile/' then a positive example would be the entailed sentence "elephants can crush you," 
a fact that one might wish to know if living near elephants. We seek to construct a theory 
about the world containing assertions such as "elephants are mobile" and "elephants are lan r e 
things" given entailed assertions such as "elephants can crush you". Such a problem is cen- 
tral to a number of areas, including expert system design (construct a knowledge base from 
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example real-world facts) and logic programming (synthesize a Prolog program from example 
input /output behavior). 

Philosophically, the distinction between these two learning tasks can be summarized as 
follows. L aming from a standard teacher can be thought of as learning to classify. In this 
setting, the unknown formula is a description of the world and the settings of the variables are 
worlds which either c.» or do not satisfy the description; thus the learner is seeing a collection 
of variable settings and is being asked to learn to classify those settings as "worlds" and "non- 
worlds". In contrast, learning from an entailed example teacher can be thought of as having 
just one world whose description is to be deduced from truths about the world. 

Though there have been numerous 1 interesting results in learning from standard teachers, 
there has been relatively little work that investigates learning from entailed example teachers. 2 
We present a polynomial time algorithm for learning propositional Horn sentences using mem- 
bership and equivalence queries from an entailed example teacher using Horn clauses as ex- 
amples. Our interest in the entailed example teacher was sparked by work in the area of 
approximate entailment [37, 92, 63, 52]. We suggest that our algorithm might be useful in this 
setting, and leave open the general question of how techniques found in learning algorithms 
might be helpful in this area. 

The current algorithm is in fact an application of the algorithm presented in the previous 
chapter, but with a number of twists. We will present two different learning algorithms - one 
of which learns the class directly and one of which applies a prediction, ^reserving reduction to 
the learning algorithm in the previous chapter. Although the main ideas in the direct algorithm 
and proof are the same as in the previous chapter, some tricks are needed to overcome a few 
difficulties. The result is, in fact, a learning algorithm for a different learning problem. As with 
the standard teacher, the entailment teacher algorithm learns propositional Horn sentences in 
time polynomial in the size of the sentence to be learned. 

Motivating our results, Angluin describes an algorithm for learning propositional Horn sen- 
tences, from equivalence, and "derivation" queries [5]. A derivation query allows the learning 
algorithm to propose a clause C\ and in response, is told whether C is "subsumed," "not sub- 

l Sce proceedings from Workshop on ComptUational Learning Theory [61, 60. 48, 47, 100. ">8]. 
'There has been some work in the area of inductive logic programming that considers learning with entailed 
first-order atoms [82] rather than clauses. 



suraed, but entailed," or "not entailed" by the Horn theory to be learned. Thus, our algorithm 
strengthens Angluin's result by limiting the types of answers given in response to a query ("en 
tailed" or "not-entailed"). Note that queries answered "subsumed", which we do not allow, 
give information about the syntactic structure, and not just the logical structure, of the theory 
to be learned. Eliminating subsumption queries has the appeal that no particular target rep- 
resentation needs to be assumed, as well might be the case when examples are provided from 
nature. Angluin leaves open the question that this chapter closes. 

The recent results in the area of concept learning from examples were catalyzed by the 
introduction of reasonable formal definitions of efficient concept learning [98, 3], and the devel- 
opment of techniques for constructing efficient learning algorithms and proving their correctness. 
In contrast, there has been relatively little work that applies these new definitions and tech- 
niques to the problem of learning from entailment (but see Section 3.2). In this chapter we 
describe a polynomial- time algorithm that can learn an unknown propositional Horn sentence 
y?* by posing certain natural queries to a, teacher. 

3.1 Introduction 

We consider the following learning protocol: Some propositional Horn sentence T* is chosen 
and is unknown to the learner, the learning algorithm is permitted membership and equivalence 
queries, and the learning algorithm sees Horn clauses labeled according to entailment by T+ as 
examples. We prove the following: 

Theorem 28 Let T m be any propositional Horn sentence over n variables. The algorithm en- 
tailHORN (figure 3.3), using equivalence queries and membership queries (of entailed clauses) 
answered with respect to the unknown formula T*, halts in time polynomial in n and the size 3 
ofT*, and outputs a Horn sentence II that is logically equivalent to T*. 

More briefly, we'll say that entailHORN exactly learns the class of Horn sentences from 
entailment, using equivalence and membership queries. This learning model is, with one impor- 
tant exception, Angluin's standard, well-investigated model of exact learning from equivalence 
queries and membership queries [3]. The difference is that here the examples queried by the 

^Thc size of a Horn sentence is the number of symbols needed to write tlte Horn sentence. 




learner and the counterexamples to the learner's hypotheses are not assignments to the vari- 
ables which may or may not satisfy T*, but rather are Horn clauses which may or may not be 
entailed by T*. We note that because Horn clause entailment by a Horn sentence is decidable, 
entailHORN can be converted to a PAC or on-line learning algorithm provided that membership 
queries remain available. 

3.1.1 Approximate Entailment 

Let T* be a known theory. In general, we would like to answer questions of the sort "Is a 
entailed by T*? ? ' Depending on T* and a, the general question can be undecidable, or at least 
computationally intractable. Consequently, several researchers have grappled recently with the 
notion of approximate entailment. Rather than answering questions of entailment about the 
actual theory T* and query a, Dalai and Etherington propose answering questions of entailment 
about a variety of strengthenings and weakenings of T* and a ([37]). They note that soundness 
or completeness or both can be lost under this protocol. Still, certain combinations of their 
strengthenings and weakenings do preserve either soundness or completeness and permit the 
question of entailment for queries to be answered efficiently. 

Kautz and Selman construct a protocol in which soundness and completeness are preserved 
even at the expense of categoricity (i.e., some questions of entailment might be answered "I 
don't know") ([92, 63]). Specifically, they consider the case where T* is a p^opositional CNF 
theory and suggest approximating T* by choosing as an upper bound the unique strongest Horn 
sentence entailed by T* and choosing as a lower bound a weakest Horn sentence which entails 
T+. Any Horn clause entailed by the upper bound is also entailed by T*, and any Horn clause 
not entailed by the lower bound is not entailed by 7*. Further, since these bounds are Horn 
sentences, questions of entailment are efficiently answerable for a large class of formulas. They 
note that the upper bound, while unique, in some cases is exponentially larger than the actual 
CNF theory T*. Similarly they note that the lower bound, while never much larger than T«, is 
not necessarily unique. 

Greiner and Schuunnans also use the idea of upper bounding and lower bounding a known 
CNF theory T m with Horn sentences ([52]). They, however, demand that the upper bound 
be small. Rather than finding the unique Horn least upper bound (which may be large), 
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they attempt to find, via hill-climbing, a small Horn upper bound that performs optimally at 
predicting questions about entailment among all Horn upper bounds that are "similar". 

Current attempts to construct the Horn least upper bound have a "generate and test" 
flavor to them - Horn clauses are "generated", and are "tested" by determining (via resolution) 
whether or not the CNF theory entails the clause. If so, they are included in the Horn least 
upper bound. 

One problem which arises is that even in cases where the Horn least upper bound is small, 
there seems to be no guarantee that the number of clauses tested is small. 4 Hence, the number 
of applications of a possibly exponential time resolution procedure might be much larger than 
tile size of the smallest representation of the Horn least upper bound. Ideally, we would like 
to derive a Horn least upper bound from a CNF theory in time polynomial. in the size of the 
Horn least upper bound. Although we cannot do this, in Section 3.4 we show how algorithm 
entailHORN can be used to "lazily" construct a Horn least upper bound of a known CNF 
theory when queries about entailment are received in an on-line setting and periodically we 
are informed of some error in our entailment prediction. Further, the number of appeals to a 
potentially exponential-time resolution procedure is bounded by a polynomial in the size of the 
smallest representation of the Horn least upper bound, and the particular expression for the 
Horn least upper bound that is found is also at most polyn6mially larger than the best such 
expression. 



3.2 Related Work 

This work builds on the work of the previous chapter that constructs an algorithm for learning 
Horn sentences from satisfying assignments. The direct algorithm presented below is in fact an 
application of either algorithm from the previous chapter, but with a number of twists. The 
standard variable assignment examples of the previous chapter might be viewed as complete 
conjuncts (or fundamental products) that are labeled as entailing or not entailing the target. 
Here we do not care about such examples at all; our examples now are Horn clauses that are 
labeled as being entailed or not being entailed by the target. As with the algorithm of the 

4 It is an easy task to construct a small CNF theory that has a small Horn least upper bound but has an 
exponential number of entailed Horn clauses. 
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previous chapter, the current algorithm learns proposition" % Horn sentences in time polynomial 
in the size of sentence to be learned. 

On the other hand, we also present an algorithm for learning from entailment constructed 
using either algorithm from the previous chapter as a "black box" once a suitable transformation 
between the entailed example environment and the standard example environment has been 
constructed. The transformation involves constructing two polynomial time algorithms that use 
membership queries and equivalence queries to act as oracles for an algorithm of the previous 
chapter - one to answer the membership queries of the type asked by that algorithm and one to 
answer the equivalence queries of the type asked by that algorithm. The task accomplished by 
each oracle is essentially converting between examples of the type available (which are clauses 
entailed by exactly one of the target and the current hypothesis) and the type used by the 
algorithms of the previous chapter. 

As discussed above, Angluin's work with entailment and subsumption for propositional 
Horn sentences [5] is closely related to the work presented in this chapter. Other authors have 
looked at examples entailed by first-order formulas. For example, Page and Frisch look at atoms 
labeled according to whether they are entailed by a hidden first order formula ([82]). Their 
work examines more the effect of multiple occurrences of a variable in the hidden theory and 
less the effect of connectives in the hidden theory. Our work also uses entailment as the means 
of labeling examples. However, we use entailed examples to learn propositional formulas where 
the connectives play a significant role within the hidden formula as well as within the examples 
themselves. 

There are some less closely related first-order results from the field of inductive logic pro- 
gramming where the target is a Prolog program. Shapiro describes a system for learning Prolog 
programs in the limit (i.e., time is an unbounded resource) using atoms entailed by the program 
to be learned ([95]). Also, Dzeroski et al. describe an algorithm that learns Ar-clause, determi- 
nate, function-free, first-order Horn clauses with bounded depth variables by transforming the 
target into a propositional monotone A:-term DNF formula ([41]). 
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3*3 The Algorithm 

We now present two algorithms that learn Horn sentences using membership and equivalence 
queries from an entailed example teacher using Horn clauses as examples. 5 

3.3.1 Learning by Reduction 

First, we present an algorithm for exactly learning Horn sentences by entailed example using 
membership and equivalence queries; the example class is Horn clauses. In order to demonstrate 
a reduction technique, we present a slightly inefficient interface which is nothing more than 
polynomial time transformations to and from the examples suitable for the Horn sentence 
learning algorithm of figure 2.1 or figure 2.2. Here, HORN refers to either algorithm. 

The interface is accomplished by constructing algorithms that simulate the oracles for mem- 
bership and equivalence needed by the HORN algorithm. The constructed oracles are permit- 
ted to use entailed example oracles for membership and equivalence, but otherwise must run in 
polynomial time. 

The constructed oracle simulator for the membership queries asked by HORN will be desig- 
nated Gstd> &nd the constructed oracle simulator for equivalence queries asked by HORN will 
be designated =std- The entailed example oracle for membership and the entailed example 
oracle for equivalence used by €std and =std are designated €ent and =ent* respectively. 
The €std oracle simulator is given in Figure 3.1 and the =std oracle simulator is given in 
Figure 3.2. 

We now argue that the answers and examples returned to HORN by these oracles are correct. 
We note by convention that "3v $ A" permits v to be the logical constant F. We also note that 
we are tacitly assuming that we know the names of all the variables used in the hidden concept. 
By assuming that the variable names are not allowed to change once our learning begins, the 
oracles constructed can simply restart the HORN algorithm with an updated set of variable 
names whenever the €ent or =ent oracles mention a new variable name. Clearly, the HORN 
algorithm will need to be restarted at most once for each variable name in the hidden concept. 

5 Thc teacher could be using Horn sentences as examples - a Horn sentence is entailed if and only if every 
Horn clause is entailed. Using membership queries about clauses (which can be viewed as degenerate sentences), 
a Horn clause not entailed by the target can be elliciently extracted from a Horn sentence not entailed by the 
•target. 
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€std(z) 

1 Let A be the variables set to true in x 

2 if 3v g A such that €ent(^ -* v) 

3 then return "no" 

4 else return "yes" 

Figure 3.1: Membership oracle for HORN. 



=std(#) 

1 if =ent(#) = "yes" then return u yes" 

2 Let C be the clause returned by =ent(#) 

3 Let A be the set of variables in the antecedent of C 

4 if C is a positive example 

5 then 

6 until A stops changing 

7 if 3i? 0 A such that ff (= A -* v then A=Al){v} 

8 Let x be the example formed by setting all the variables in A to T and all 
variables not in A to F 

9 return i as a negative example 

10 else 

11 until A stops changing 

12 if 3v £ A such that Gent(^ -* v) then A = A U {v} 

13 Let x be the example formed by setting all the variables in A to T and all 
variables not in A to F 

14 return x as a positive example 

Figure 3.2: Equivalence oracle for HORN. 
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For purposes of exposition, we routinely treat a complete monomial and the unique variable 
assignment which satisfies it interchangeably. Likewise, we treat a set of variables and the 
conjunction of those variables interchangeably. 

Lemma 29 The oracle Gstd' correctly answers, in time polynomial in the number of variable 
names in the hidden concept, any membership query asked by HORN. 

Proof: Clearly Gstd runs in polynomial time. We must now show that the answer it provides 
is correct. 

The example queried by HORN is a negative example if and only if it falsifies some clause 
of the hidden concept. An example falsifies a clause, c, of the hidden concept if and only if the 
example sets every variable in the antecedent of c to T and sets the consequent of c to F. But 
in this case, the clause whose antecedent is the set of variables set to T by the example and 
whose consequent is the consequent of c (which cannot be among the variables set T by the 
example) is subsumed 6 by c and is therefore entailed by the hidden target. Necessarily, among 
the Gent queries asked by Gstd? is this particular clause. □ 

Lemma 30 The oracle =std correctly answers, in time polynomial in the number of variable 
names in the hidden concept and in time polynomial in the length of the presented concept, any 
equivalence query asked by HORN. 

Proof: Clearly, the first line of =std correctly determines whether the presented concept and 
the hidden concept are logically equivalent. However, this is not enough. We must also show 
that any example returned to HORN by =std is an example on which the presented concept 
and the hidden concept disagree. 

If the clause C returned by the =ent query is a positive example, then it is entailed by the 
hidden concept and it is not entailed by the presented concept. Thus, step 7 cannot cause the 
consequent of C to be added to A. However, A does contain the antecedent of C. Thus the 
example x returned to HORN falsifies C and therefore falsifies the hidden concept. At the same 
time the construction in step 7 guarantees that x satisfies the presented concept. This because 

s Thc meaning of u is subsumed by" used here comes from automated theorem proving; the specilic meaning 
here in this propositional setting is 'is formed from a set of literals that is a subset of the literals of. Chapter 7, 
consistent with the terminology of description logics, uses the phrase differently. 
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step 7 finds the unique set of the variables that H demands be satisfied when the variables in 
the antecedent of C are satisfied; if H is falsified when exactly this constructed set of variables 
is satisfied, then H \= C , contradicting the fact that C was a positive counterexample for H. 
Also note that step 7 runs in time polynomial in the size of presented concept because for any 
Horn sentence H and any Horn clause A, H \= A can be answered in time polynomial in the 
sizes of H and A. At most 0(n 2 ) such entailment questions are asked in this step, where n is 
the number of variable names in the hidden concept. 

Similarly, if the clause C returned by the =ent query is a negative example, then it is 
entailed by the presented concept and is not entailed by the target concept. Step 11 constructs 
an example which satisfies the hidden concept and falsifies the presented concept. The only 
difference between this step and step 6 is that we cannot answer the series of entailment question 
ourselves, so we use the Gent oracle to obtain the answers to the required entailment queries. 

□ 

3.3.2 Learning Directly 

Next we present an algorithm that does not rely on the existence of an algorithm that learns from 
a teacher who provides the standard variable assignment examples. Throughout, a (and any 
variation of a) is a (possibly empty) conjunction of propositional variables. We will frequently 
treat a as a set; in the case that a is empty, its conjunction represents the logical constant T. 
Throughout the paper, j3 (and any variation of j3) represents a set of at most one propositional 
variable; in the case the /? is empty, it is understood to represent the logical constant F. We 
represent propositional Horn clauses with the schema a — *■ 0; we refer to a as the antecedent 
of the clause and to j3 as the consequent. 

We use the boldface £ to denote a membership query and the non-boldface G for normal 
set membership. As is standard, we use \= to denote logical entailment and use fl and C for set 
intersection and proper subset, respectively. The algorithm entailHORN is shown in figure 3.3. 

Two crucial properties of entailHORN are (1) that entailHORN keeps a monotonically 
growing sequence of antecedents approxAnts that provide clues about distinct clauses of the 
target and (2) that entailHORN provides increasingly better approximations as to the identity 
of these clauses. The current hypothesis hypoth is constructed from the antecedents comprising 
approxAnts by asking for each of these antecedents which consequents are supported by the 
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target (line 8). Upon receiving an equivalence query counterexample C (line 3), entailHORN 
computes the (unique) minimal model of hypoth satisfying the antecedent of C (line 4) and 
tries to use this minimal model to chip away excess variables in some antecedent in approxAnts 
(line 6). If this attempt fails, entailHORN appends this minimal model to approxAnts as the 
initial approximation to the antecedent of some clause of the target for which no information 
(in the form of a counterexample) has yet been provided (line 7). 

Before arguing that entailHORN is correct and runs efficiently, we need the following sup- 
porting definitions and lemmas. 

Definition 31 // <j> is a satisfiable formula, ip is a falsifiable formula, and <f> |= ip, then cj> is 
said to entail ip non-trivially. 

Next we present a series of lemmas which will enable us to prove the correctness of entail- 
HORN. 

Lemma 32 // (a — /?) (= (a' — f3 f ) non-trivially, then a C a' and (3 C /?'. 

Proof: If a % a', then set all variables in a f to T and all other variables to F. Since there is 
some variable in a - a f that is set F, this setting satisfies a — ► /?. Since a' — ► /?' is falsifiable, 
this assignment falsifies it. But then (a — ► (3) ^ (a' — ► /?'), a contradiction. 

Similarly, if /? % /?', then assign F to those variables in /?' and T to all other variables. 
Because a f — ► /?' is falsifiable, this assignment falsifies it. However, since (3 % /?', there is some 
variable in (3 that is assigned T so that a — ► (3 is satisfied, a contradiction. □ 

Lemma 33 Let T be a Horn sentence. If T |= (a — ► j3) non-trivially, then a is a superset of 
the antecedent of some clause of T . 

Proof: If T contains some clause with an empty antecedent, then the lemma holds vacuously. 

For the case when T contains no clauses with empty antecedent, suppose by way of con- 
tradiction that the lemma does not hold. Assign T to all variables in a and F to all other 
variables. (Since a — > 0 is falsifiable, this assignment falsifies it.) Now notice that since by- 
supposition no antecedent in T is contained in a, at least one variable in each antecedent of 7 1 
is F, thus every antecedent in T is F, and so T must be satisfied. Therefore T |£ a — /i, a 
contradiction. □ 

Naming the hidden target Horn theory 7«, we will now argue correctness. 
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entailHorn 


1 


approxAnts = 0 


2 


hypoth = W 


3 


while hypoth is not equivalent to the target, let C be the counterexample returned 




by the equivalence oracle 


4 


a = {0 : hypoth |= (ant(C) — /?)} 


5 


if 3a € approxAnts such that 




• a fl a C a 




• RHS(aHd)^0 


6 


then replace the first such a in approxAnts by a fl a 


7 


else append a to approxAnts 


8 


hypoth = {a — ► /? : a <E approxAnts,/? £ RHS(a)} 


9 


return hypoth 



Figure 3.3: The algorithm entailHORN used to learn Horn sentences. 



RHS(a) 

1 return {v : v £ a, a — *■ v)} 
°It is permissible for v be the logical constant F> 

Figure 3.4: Find all consequents of entailed clauses having a given antecedent. 
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Lemma 34 At all times during the run of entailHORN, T* \= hypoth. 



Proof: The subroutine RHS is used to explicitly check that each clause of hypoth is entailed 
by T*. A clause a — ► /? is placed in hypoth only if (5 G RHS(a) (line 8). This can only occur if 
€{a — * /?), hence T* (= (a -* p). Since T* entails every clause of hypoth, T m (= hypoth. □ 

Corollary 35 Ant/ counterexample to hypoth with respect to T* must be entailed by T* but not 
by hypoth. 

Notice that we can easily determine whether the target is unsatisfiable by first making an 
equivalence query on an unsatisfiable Horn sentence, so for expository purposes we henceforth 
assume that the target is satisfiable. Also, if the target is a tautology then the first equivalence 
query made by entailHORN will be correct, so we henceforth also assume that the target is 
falsifiable. Finally, under these assumptions, since the counterexamples obtained by equivalence 
queries (line 3) are entailed by exactly one of two satisfiable Horn sentences, counterexample 
clauses are neither unfalsifiable nor unsatisfiable. 

Lemma 36 If a — ► 0 is any counterexample of hypoth with respect to T m , then there is some 
clause a* -* /?* ofT* such that a* C a and [)* $ a. 

Proof: By Corollary 35, T, (= (a — /?) and hypoth ^ (a — /J). Because hypoth ^ (a -> /?), 
a — ► (3 must be falsifiable. Consider the variable assignment that makes a T and makes all 
other variables F; this assignment falsifies a — /?. Since T* (= a -* /? this assignment must 
falsify T*. But T* is falsified if and only if some clause a* — *■ /?* of T* is falsified. But this 
assignment falsifies a* -+ /?* if and only if a* C a and /?* £ a. □ 

The following lemma shows that no matter what counterexample entailHORN receives from 
the equivalence oracle, the antecedent that entailHORN actually uses to modify approxAnts is 
the antecedent of a (possibly different) counterexample. 

Lemma 37 // a — (5 is <■ counterexample returned to entailHORN by the equivalence query 
oracle in line 3, then the a computed in line 4 i* such that a — * fj is also a counterexample. 

Proof: By Corollary 35, 1\ |= (« — 0) and hypoth \fz (« — 0). Clearly a contains nr, thus 
|= (d — /j). We now show that if hypoth |= (a — then hypoth |= (a — 0). 



Suppose that hypoth j= (d — ► 0). Let e be any variable assignment that satisfies hypoth. 
If e does not satisfy a, then e satisfies a — 0. On the other hand, if e does satisfy a then, 
since e satisfies hypoth, by construction e also satisfies d. Since hypoth \= (d — ► 0) and s'nce e 
satisfies hypoth and d, e must satisfy 0. But then e satisfies a — ► /?. Thus hypoth (= (a — /?). 

But hypoth (a — ► /?). Thus hypoth |£ (d — ► /?). Therefore, d — *■ /? is a counterexample 
to hypoth with respect to T*. □ 

We want to show that hypoth is becoming increasingly closer to T*; to do this we have the 
following definition which will help describe how far hypoth is from 7V 

Definition 38 For a G approxAnts, we say that a properly covers a clause a* — *■ 0+ if a* C a 
and 0+ £ a 

We are now prepared to state and prove our main lemma. 

Lemma 39 // oc\ and a*i are antecedents in distinct positions of the sequence approxAnts, 
then a x and ql<i each properly cover some clause ofT m , but a\ and c*2 do not properly cover the 
same clause ofT*. 

Proof: First observe that for every a in approxAnts there is some 0 such that T* \= (a — ► 0), 
This is because line 6 replaces an element of approxAnts only if the replacement, a fl d, is 
such that T* (= (a fl d) — ► 0 for some 0, On the other hand, line 7 simply appends the 
antecedent of the (possibly implicit (Lemma 37)) counterexample constructed in line 3. But by 
Corollary 35, this counterexample is entailed by 7V Since every a, when placed in approxAnts, 
was a counterexample to hypoth, every a is the antecedent of some clause tha is non-trivially 
entailed by T». Thus, by Lemma 33, every a in approxAnts properly covers some clause of T*. 
Hence ay properly covers some clause of T* and a<i properly covers some clause of T*. 

We now show that the clauses properly covered by ct\ and a<i are different. Suppose not 
and consider the first iteration of the main loop in which this property fails to hold. Let a\ 
and c*2 be two elements of approxAnts that properly cover the same clause a* — 0* of T*. 
Without loss of generality, assume that this iteration of the main loop caused t*2 to appear in 
approxAnts. There are two ways in which this iteration could place t*2 approxAnts ■ either 
c*2 was appended to approxAnts or a 2 was the result of intersecting a with some a f 2 that was 
in approxAnts in the previous iteration of the loop. 



In the former ca>.e, since a x and 0:2 (which is actually d) both properly cover a* — 
a x H d properly covers a* — > #*, thus p* G RHS(ax H d) and so RHS(ax fid) ^ 0. Thus, for 
entailHORN not to have used d to replace a\ in approxAnts, it must be that a\ 0 d = ax- 
But, since a x properly covers a, — > /?*, P* G RHS(ax), and since ax was in approxAnts when 
hypoth was constructed for this iteration hypoth contains a x — > Thus /?„ G d, contradicting 
the assumption that d properly covers a* — > /?♦• Thus, d could not have been appended to 
approxAnts. 

In the latter case, where d was used to refine some antecedent already in approxAnts, we 
have that c*2 = d D a 2 where a 2 is some element of approxAnts in the preceding iteration of 
the loop. Either a x precedes c*2 in approxAnts, or ax follows a2 in approxAnts. 

Consider the case in which at precedes a2 in approxAnts. If a 2 H a properly covers a*, 
then either a 2 properly covers a* — ► /?* or a properly covers a* —►/?*. But if a 2 properly 
covered a* -* /?*, then the inductive hypothesis is violated because ax and a 2 properly covered 
the same clause in the previous iteration. Thus, it must be the case that d properly covers 
a* — > /?*. Since entailHORN did not use d to refine ax, it must be the case that d H ax = <X\- 
But, in this case ai C d and since ax properly covers a* — > /?*, then /?* £ RHS(ax), and since 
ax was in S when hypoth was formed for this iteration, /?* G d, contradicting the assumption 
that d properly covers a* — ► 

Last consider the case where ax follows a2 in approxAnts, in which case entailHORN did 
not try to refine ai with d. If a 2 H d properly covers a* — * /i* then either a 2 properly covered 
a* — > /?♦ or d properly covered a* -* /?«. If a 2 properly covered a* -» a?*, then the inductive 
hypothesis is violated because in the previous iteration ai and a 2 properly covered the same 
clause. Thus it must be the case that d properly covers a* — * that a* C a 2 , and that 
0* G a^; otherwise, a 2 fl d could not properly cover a* — and yet have a 2 not properly 
cover a* — 0*. Now by assumption at properly covers a* — /?*. Consider all antecedents 
used by entailHORN to create or refine (by intersection) the position in approxAnts currently 
occupied by ct\. Since a* C a^ every antecedent used by entailHORN in the position currently 
occupied by ai must have contained a*. Further, since ai does not contain p m , one of these 
antecedents must not have contained 0*. Call the first such antecedent d'. Now observe that 
since a 2 contains both a„ and we know that every antecedent that has occupied the position 
currently occupied by a 2 mubt have contained a* and ii*. But then for every such predecessor. 
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a^y of ct2 we have that ct{ fi a! is a proper subset of because contains /?♦ but d' does 
not. Further, because RHS(a2 fl d') properly covers a* 0 we have that /?« 6 RHS(a2 n 
and therefore RHS^nd') is non-empty. This contradicts the assumption that d' was used to 
refine the position currently occupied by a\. 

Thus the antecedents in approxAnts properly cover distinct clauses of □ 

We now state our main theorem. 

Theorem 40 Let V be a finite set of propositional variables. The class of Horn formulas using 
only variables from V is polynomial time learnable from entailed examples using equivalence 
and membership queries. 

Proof: It is easily verified that each step in the while loop of entailHORN takes time at most 
polynomial in | V" [ and size of the unknown Horn formula. We need only argue that there are 
only polynomially many iterations of the loop. 

By Lemma 39, every antecedent in approxAnts properly covers some antecedent in T* and 
no two antecedents in approxAnts properly cover the same antecedent of T*. Thus approxAnts 
never contains more elements than there are clauses in T*. Furthermore, entailHORN never 
deletes any element of approxAnts, at worst it replaces some element of approxAnts by a 
proper subset of that element. This can happen at most as many times as there are variables 
in V. □ 

We immediately have the following corollaries. 

Corollary 41 Let V be a finite set of propositional variables. The class of Horn formulas using 
only variables from V is polynomial time learnable using examples and membership queries using 
Horn sentences as examples. 

Proof: (Sketch) At least one Horn clause of a counterexample Horn sentence must be suitable 
as counterexample for entailHORN. □ 

Corollary 42 Let V be a finite set of propositional variables. The class of Horn formulas 
using only variables from V is polynomial time PAC learnable using random examples and 
membership queries using Horn clauses as examples. 
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Corollary 43 Let V be a finite set of propositional variables. The class of Horn formulas 
using only variables from V is polynomial time PAC learnable using random examples and 
membership queries using Horn sentences as examples. 

3.3.3 Membership Queries are Necessary 

In contrast to Corollary 41, we show that equivalence queries do not wield sufficient power alone 
to allow exact learning of Horn sentences from entailment using Horn sentences as examples. 
We also give evidence that PAC learning is similarly crippled without membership queries. This 
shows that the membership queries used in the previous section are necessary. 

Before presenting the result, we define a monotone negative CNF formula as a conjunction 
of clauses containing only negated literals. To show the necessity of using membership queries, 
we start with the following lemma. The theorem following the lemma is the result we seek in 
this section. 

Lemma 44 There exists an exact learning algorithm for Horn sentences that uses equivalence 
queries to an entailed example teacher using Horn sentences as examples, only if there exists 
an exact learning algorithm for monotone negative CNF formula that uses equivalence queries 
to a standard teacher. 

Proof: Any assignment to a set of propositional variables can be encoded as a conjunction 
of lit' r als over the variables - if the variable X{ is to be set T then place X{ in the conjunc- 
tion, otherwise place ^X{ in the conjunction. Clearly only the desired assignment satisfies this 
conjunction. Now observe that every monotone negative CNF formula is a Horn sentence, 
and assume that A is an algorithm that learns Horn sentences from entailment using only 
equivalence queries. Let C* be the target monotone negative CNF formula. 

When A makes an equivalence query for some hypothesis //', form // by deleting any 
positive literals 7 in //' and make an equivalence query against (7*. If the query is answered 
"yes" then we are clone because // is logically equivalent to C*. If a positive counterexample c p 
is returned (which by definition falsifies //, and therefore satisfies all the negated variables of 
one of its clauses), then the monotonicity of (7* guarantees that changing any variable setting 
in c p from T to F results in a positive example of C m . So, if e p satisfies e p must make T 

' If this produces a. clause with no literals, then take tl to be an unsatisfiable formula. 



56 

6B 



any unnegated variables appearing in any clause in which the remaining (negated) variables 
are made true; obtain e f p by setting any such variable to F. Then e' p falsifies H 1 and H so that 
like e p , e p is also (positive) counterexample to H against C*. On the other hand, if a negative 
counterexample e n is returned (which by definition satisfies i7), since H was obtained by simply 
deleting literals from (already satisfied) clauses of H\ e n satisfies H i \ 

It now follows, that by returning the counterexample (e n , e p , or e p7 as appropriate) encoded 
as a conjunction of literals satisfied only by the counterexample produces a learning algorithm 
for the class of negative monotone CNF formulas. □ 

Angluin [6] has shown that monotone DNF formulas (DNF formulas that have only un- 
negated variables) are not learnable from equivalence queries alone, even when the learner is 
allowed to make queries about arbitrary DNF formulas. Our theorem now follows immediately. 

Theorem 45 Horn sentences cannot be exactly learned (with equivalence queries, but without 
membership queries) from an entailed example teacher usii*g Horn sentences as examples. 

Proof Sketch: The dual of monotone negative CNF is monotone DNF; simply negating the 
label of an example provides the correct transformation. □ 

We next comment on PAC learning (without membership queries) Horn sentences from an 
entailed example teacher; we have the following lemma and corollary, which relate PAC entailed 
Horn learnability to standard DNF PAC learnability. The standard PAC learnability of DNF 
remains the central open problem in computational learning theory. 

Corollary 46 Learning Horn sentences with equivalence queries from an entailed example 
teacher using Horn sentences as examples is as hard a PAC learning arbitrary DNF formu- 
las from a standard teacher. 

Proof Sketch: Because Kearns et al. [65] have shown that PAC learning DNF without 
membership queries is no easier than PAC learning monotone DNF without membership queries, 
the construction reducing the exact monotone DNF standard learning problem to the exact Horn 
entailment learning problem leading up to Theorem 45 suffices to reduce the PAC (without 
membership queries) monotone DNF standard learning problem to the PAC Horn entailment 
learning problem. Thus a PAC Horn entailment learning algorithm would produce a standard 
PAC DNF learning algorithm. □ 
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3.4 Application to Approximate Entailment 

We now consider the application of these entailment results outside the field of computational 
learning theory by considering the problem of finding the Horn least upper bound of a known 
CNF theory, where the Horn least upper bound is defined to be the (unique) logically strongest 
Horn sentence entailed by the CNF theory. Even when the Horn least upper bound for some 
theory is small, current methods do not guarantee that a small least upper bound will be found 
without post-processing the set of entailed Horn clauses found with an algorithm such as the 
propositional Horn compression result of Corollary 24. Such an approach does not lessen the 
number of resolution proofs needed in the first place to obtain the set of entailed clauses. In 
contrast the entailment algorithm suggested below will never approximate the actual Horn least 
upper bound by any Horn sentence having more distinct antecedents than the actual Horn least 
upper bound. This implies that our algorithm will never approximate the actual Horn least 
upper bound by a Horn sentence that has more than n times as many clauses - and hence is 
more than n 2 times as large - as the actual Horn least upper bound where n is the number of 
variables in the theory. This suggests the approach of using entailHORN to find the Horn least 
upper bound, using the known CNF theory to answer membership and equivalence queries. 
However, although the final Horn sentence is guaranteed to be small, this approach too, could 
result in far too many resolution applications. Instead, we consider a setting in which the 
CNF theory is manipulated only when necessary (we consider running entailHORN in a "lazy" 
manner), and the environment is used to provide counterexamples to the correctness of the 
currently hypothesized Horn least upper bound. We consider two settings in which such an 
approach seems plausible. 

Consider a setting in which an agent who happens to live near elephants is provided with 
some theory (p that describes the world. Our goal is to have the agent maneuver in the world 
without getting crushed. Among the ways in which our goal can be achieved are (1) have 
the agent cower in a corner predicting that everything will crush him, (2) in each situation 
determine from cp whether he will be crushed, or (3) tractably approximate <p by <p f and have 
him update <p f only when he observes (unpredicted by (p f ) something being crushed. 

Clearly solution (1) is uninteresting. Solution (2) potentially suffers from the uninteresting 
features of solution (1) because in the cases where (p is intractable, the agent will, in effect,* 
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spend most of his time in a corner predicting that he will be crushed searching for a proof 
to the contrary. Solution (3) seems plausible in that the agent will spend much of his time 
exploring the world and only retreat to a corner to spend time manipulating <f> when nature 
deems it necessary. We refer to the agent operating under solution (3) as a lazy agent. A bit 
more formally, 

Definition 47 Let be a (propositional) CNF theory, let £ = {<7i,<72, . . .} be a set of unlabeled 
clauses whose entailment by cj> is to be predicted, and let T = {i>i,i>2,. . .} be a set of urgent 
clauses labeled according to entailment by <j>. Let fi,f2> • • • be a sequence defined over elements 
of £ U T. A lazy agent is an agent who uses a prediction strategy that 

1. predicts the entailment of each & "quickly", 

2. updates the prediction strategy only upon incorrectly predicting an element ofT, and 

3. predicts that <j> entails & only if indeed <j> |= & 

Intuitively, the longer lived lazy agent is the one who predicts that more things are entailed. 
The scare quotes around the word "quickly" in the definition are intended to bias the agent 
toward using some <f> ! that is a tractable (yet small relative to <t>) approximation to <£, if such a 
<£' exists. Updating $ only upon seeing (labeled) elements of T is intended to capture the idea 
that only those predictions whose outcomes are observable in the world are important enough 
for the agent to spend time adjusting his prediction strategy to predict them correctly; it can 
be argued that the agent is concentrating on the facts that are relevant to the environment he 
is exploring. 

Given this definition and discussion, an agent who uses entailHORN to acquire the Horn 
least upper bound to <f> and uses intermediate hypotheses hypoth to predict the entailment of & 
seems a satisfactory lazy agent. Any clause entailed by the Horn least upper bound of <f> must 
be entailed by some Horn clause entailed by the Horn least upper bound. It is an easy matter 
to test all of the (at most n) largest Horn clauses that entail the clause to see which ones are 
not entailed by hypoth and from among those use resolution on cj> to find some new Horn clause 
to use as a counterexample for entailHORN to update hypoth. Further because entailHORN 
is an exact learning algorithm, we can treat it as an on-line algorithm with bounded mistakes. 
This means that on Horn clauses of T the agent will make at most a number of predictive 
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errors polynomial in the size of the Horn least upper bound of <j>. This is a quantitative means 
of measuring the amount of time spent on resolution over <j>. 

As a second setting, consider the work by Greiner and Schuurmans, which assumes an 
agent who receives a sequence of randomly generated clauses ([52]). For each clause, the agent 
determines (via resolution on <j>) whether the clause is entailed, and if necessary, adjusts his 
approximation <$>" to the Horn least upper bound <j>' y to correct his prediction. However, the 
agent constrains <p" to have size at most k y so that prediction remains efficient. To accomplish 
their goal, they define a set of transformations between Horn sentences of size at most k. In 
the end, they show that they find a 4> u that is locally optimal with respect to the neighborhood 
defined by the transformations. Hence, they may fail to find a global optimum, even if there is 
one of size at most k. 

In contrast, we consider an agent who is given a set of randomly generated clauses, labeled 
by entailment according to <j>. Since our algorithm produces a sequence of Horn sentences of 
monotonically increasing size, we can easily determine whether there is any Horn sentence of 
size k consistent with the sample. If the true Horn least upper bound has size at most Jfc, or if 
the best size k Horn least upper bound approximation is consistent with the sample, then our 
algorithm will find a comparably performing Horn sentence having size at most kn 2 . In this 
setting too, we are able to quantitatively specify the resolution work we must do by noting that 
resolution over <f> is required only to answer membership queries, and our algorithm will make 
at most a number of membership queries that is polynomial in kn 2 and the size of the sample. 

3.5 Discussion 

Using techniques very similar to Chapter 2, our work closes a question in computational learning 
theory left open by [2, 5]. The result is an algorithm that by asking natural questions is 
guaranteed to produce a Horn sentence consistent with a set of clauses labeled as to whether 
entailed by an unknown Horn sentence; furthermore, the algorithm is guaranteed to take time 
at most polynomial in the size of the smallest among all such consistent Horn sentences. 

This new algorithm is then applied to the problem of approximate entailment found in the 
field of machine learning. Here, this work adds to the approaches found in the literature. Our 
approach is to confine the appeals to resolution to only those places where our polynomial time 
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learning algorithm specifically asks about the entailment of a specific Horn clause. Our goal 
is to accelerate the construction of a tractable Horn upper bound of a known CNF theory by 
providing a sharp focus for the resolution machinery. A side effect of our learning algorithm 
is that it implicitly maintains a monotonically increasing lower bound on the size of the Horn 
least upper bound of the CNF theory. 

We hope techniques from computational learning theory will continue to find fruitful appli- 
cation in the field of machine learning. For example, can efficient learning algorithms further 
limit the number of appeals to or further limit the focus of resolution in constructing a tractable, 
useful approximation to a given CNF theory? 
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Chapter 4 

Membership by Subsumption 



What formal utility can be imparted to an expert's response time when presented with a 
question? Computer Science is replete with examples of caching for speed improvement. 
Computer memory hierarchies remember the most recently accessed data item. Operating 
systems remember the most recently accessed fragment of code. Caching is founded on the 
principle that what was useful recently will be useful shortly. 

We might imagine that a domain expert operates in a similar fashion; faced with a problem, 
the expert may start with those ideas that proved useful very recently, and in this way he can 
be seen as caching recent results for use in the near future. Notice that this view assumes that 
the expert does not reason from first principles right away - indeed, at any given moment his 
description of the world may be quite different than the shortest possible description. However, 
because the expert uses knowledge that has proven useful recently, a problem that is very similar 
to a recent problem can be solved quite quickly given the expert's current description of the 
world, whereas that same new problem might require significant effort to solve given only the 
smallest possible world description. Thus, the expert's current world description is related to 
which problems he can solve quickly - independent of his ability to articulate that description. 

In the previous two chapters, we have considered two notions of example labeling. We have 
seen variable assignment examples labeled according to whether the target is satisfied, and we 
have seen clause examples labeled according to entailment by the target. In this chapter we 
consider a third kind of example labeling that seeks to capture the efficiency of an expert in 
answering questions. Here we imagine that the expert has reached some fixed, cached repre- 



sentation of the world and he answers question according to that representation. We explore 
the interaction of a learner with such an expert. Our goal is to learn a fixed snapshot of the 
expert's cache at some moment in time, believing that by doing so we are capturing not only 
a description of the world, but also the proven efficiency of the expert in reasoning about the 
world. 

Angluin [5] provided an algorithm for learning a restricted class of propositional Horn sen- 
tence which uses what she terms a request for hint query. The answer to this query tells the 
learner whether his chosen clause is subsumed by some clause of the target. In terms of an 
expert/apprentice interaction, we view this as the apprentice being told that his question is triv- 
ially answered by comparing the clause to the set of clauses in a proven, useful representation 
- that is, the expert can answer the question quickly (as opposed to taking time to construct a 
complicated proof). 

We present a learning algorithm that uses subsumption queries and equivalence queries. 
We then show that the algorithm is a polynomial time learning algorithm for any class of CNF 
formulas that is closed under projection, closed under clause deletion, and has a polynomial 
time solution to the satisfiability problem. To illustrate the utility of this generic algorithm, 
we demonstrate that it learns a class of formulas constructed by Boros et al. [27] that properly 
contains the class of Horn sentences and was n">t previously known to be learnable. 

4.1 Definitions 

In order to formally model this expert/apprentice relationship, we need to specify precisely 
what is to be learned by the apprentice, what information is available to the apprentice, and 
the types of questions the apprentice is permitted to ask. Here we assume that the expert's 
description of the world belongs to some class T of propositional CNF formulas. We assume that 
the apprentice knows T\ however, we assume that the apprentice does not know the expert's 
world description itself. The apprentice's goal is to learn the expert's target description. 

If we expect the apprentice to correct an erroneous hypothesis of the world, we must show 
him that his hypothesis is, in fact, in error. The means by which we allow the apprentice access 
to errors in his hypothesis is once again an equivalence query. 
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To consult with the expert, we allow the apprentice subsumption queries. Recall that a CNF 
formula is a conjunction of disjunctions of literals and that for propositional clauses C and C", 
C subsumes a clause C exactly when the set of literals in C f is a subset of the literals in C. A 
subsumption query occurs when the apprentice chooses a clause and asks the expert whether 
the clause is subsumed by some clause of th? target. The expert answers either "yes" or "no". 
Thus, subsumption queries can actually be viewed as membership queries using as examples 
clauses labeled according to subsumption by the expert's description of the world. Notice that 
the expert can answer the subsumption query in time polynomial in the product of the the 
lengths of the apprentice's clause and the target. 

To obtain polynomial time learnability for a class F of CNF formulas, we depend on T 
possessing three properties - the satisfiability problem for J 7 is decidable in polynomial time, 
T is closed under variable projection, and T is closed under clause deletion. We now define 
these properties. 

Definition 48 Lei T be a class of formulas, and let <t> £ T . The satisfiability question for <t> 
is "Does there exist a truth assignment for the variables appearing in <f> that satisfies <f>?" The 
satisfiability problem for T is decidable in polynomial time if there exists an algorithm that 
correctly answers the satisfiability question for any <f> £ T and runs in time polynomial in the 
length of <f>. 

Definition 49 Let T be a class of formulas. We say that T is closed under variable projection 
if for any <f> 6 T and any variable v appearing in <f>, fixing the truth value of v to be either T 
or F produces a formula <f>' e T such that the size of is no greater than the size <f>. 

Definition 50 Let J 7 be a class of CNF formulas. We say the T is closed under clause deletion 
if deleting any clause from any <p G T produces a formula <f/ £ T . 

We close this section with a word on notation. We represent clauses C in an implicative 
form denoted a — 0 where a is the conjunction of variables occurring negated in C and ii is 
the disjunction of variables occurring unnegated in C. 
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4*2 The Generic Algorithm 

This section, proves our main result of this chapter, but first we need a general property relating 
entailed clauses to subsumption. 

Lemma 51 Let <f> = C\ A C2 A • • • A C m be any CNF formula, and let a — * (3 be any clause C 
entailed by (j>. Then there exists some d = a,- — ► /?,• in (j> such that C\ subsumes a — * /?,-. 

Proof: Let M be the variable assignment that sets every variable in a to T and all other 
variables to F; clearly, M falsifies C. Since <j> (= C, it must be the case that M also falsifies 4>, 
and therefore M must falsify some clause C, — a* — ► Pi of (j>. This means that every variable in 
a,- is set T and every variable in /?; is set F. But then a must contain every variable in a; and 
none of the variables in /?,-, so a,- — ► /?,• is the clause whose existence is asserted by the statement 
of the lemma. □ 

Armed with the ability to ask subsumption queries, we now present a learning algorithm for 
any class T of formulas that is closed under variable projection, closed under clause deletion, 
and for which the satisfiability problem is polynomial time decidable. We have the following 
theorem. 

Theorem 52 Let T be any class of CNF formulas such that 

• T is closed under variable projection, 

• T is closed under clause deletion, and 

• the satisfiability problem for T is decidable in polynomial time 

Then T is polynomial time learnable given equivalence and subsumption queries. 

Proof: Consider the algorithm Generic shown in figure 4.1. Initially h is the universally true 
hypothesis, thus (/> (= h. Every clause conjoined to h in line 17 is subsumed by some clause of 
(j> y therefore <i> (= h after every update. Thus, every c mnterexample obtained in line 2 must be 
a clause C that h improperly fails to entail. Thus, there must be a satisfying assignment for h 
that falsifies C . This satisfying assignment is computed in the loop beginning in line 7. 

Now, in line 13, after the satisfying assignment has been computed, T and contain the 
variables assigned T and F, respectively. Since this assignment falsifies C and <i> (= (7, this 
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assignment falsifies <j>. Therefore <f> |= {/\ xe r x ) (V y eFf)- Now b y Lemma 51 there is some 
/?' such that (Axgt) 0' * s subsumed by some clause of <p; certainly V y eF contains that 
and so some clause of cj> subsumes (Arer) ~ * (V^gf)- 

The loop beginning in line 13 removes variables from F one at a time while ensuring that 
some clause of still subsumes the resulting clause. Thus this loop finds a minimal set F, i.e., 
finds exactly some /?' that is actually the /?,• for some clause Ci of <£. 

Similarly, the loop beginning in line 15 produces a minimal set T such that (AreT x ) 
(VyG^y) ' s subsumed by some clause of <f>, i.e., T will contain exactly the variables of c*j for 
some clause C% of <f> whose /? t is exactly contained in F. In other words, the clause added to h 
in line 17 is some actual clause of <p. To see that this is not the same as some clause already in 
h y notice that the assignment computed in the loop at line .7 satisfies every clause already in h< 
but this assignment falsifies the clause added to h in line 17. 

In regard to running time, since every iteration of the while loop in line 2 produces a new 
clause of the target, there are no more iterations than there are clauses of the target. In the 
body of the loop, constructing the satisfying assignment for h deserves consideration. Since h 
contains a subset of the clauses of <£, it must be that h G T because T is closed under clause 
deletion. Similarly, provisionally setting a variable to T in line 8 produces a formula in T no 
larger than h because T is closed under variable projection. Last, the test as to whether the 
variable can be set T in line 8 can be done in polynomial time because satisfiability for T is 
decidable in polynomial time. For the remainder of the while loop body, minimizing the sets 
T and F requires examining each variable only once. □ 

4.2.1 A Note on Disposing of Equivalence Queries 

As mentioned earlier, it is possible (and perhaps appealing) to replace the equivalence queries 
with randomly generated examples, [f our apprentice simply observes nature and verifies that 
his hypothesis correctly predicts 1 each observation, then any misprediction can be treated as a 
counterexample to what would have been an equivalence query. Assuming the expert's descrip- 
tion of the world is accurate, the apprentice will make only a number of mistakes polynomial 
in the size of the expert's description. Thus the apprentice can acquire the expert's efficient 

1 Because T is closed under variable projection and the satisfiability problem for T is decidable in polynomial 
time, prediction of an observation ie, entailment by the hypothesis is efficiently decidable for the apprentice. 
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Set h to be the empty hypothesis 
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While EQ(/i) returns a counterexample a —* p 
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Set T to be the set of variables in a 
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Let V be the set of unassigned variables in h f 




/* Find a satisfying assignment for fi f */ 


7 


For each v € V 


8 


Tf Iq ca t i cfi Kl*3 rtrhfan u ic 

11 lit ID DO/b 1 DllCUUlt; VV ILdL \J ID X 


Q 


Set v T, and project v out of A' 


1 n 


Place v in T 




p! cq 
CI DC 


1 1 

1 X 


oet u r , anci pioject v out 01 n 


12 


Place v in F 




/* Remove extraneous variables from the consequent */ 
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For each variable v in F 
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[f SUBSUME((A r6T x) - (V y6 F-{v} »)), remove « from F 




/* Remove extraneous variables from the antecedent */ 
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For each variable v in T 
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If SUBSUME((A r€ r-{v} *) ~* (V y6 F 2/)), remove v from T 




/* Add the new target clause to the hypothesis */ 


17 


fc:=*V((A, 6 r*)-(V^F»)) 



Figure 4.1: The Generic learning algorithm. 
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description of the world without demanding that the expert answer any equivalence queries, or 
indeed, without demanding that the expert even be able to articulate his description. 

4.3 A Newly Learnable Class 

Boros et al. [27] define a class of CNF formulas for which the satisfiability problem is decidable 
in polynomial time. This class, which we call BCH, is defined next. 

Definition 53 Let V a set of proportional variables. Partition V into two disjoint sets X and 
Y . The class BCH is the set of arbitrary conjunctions of clauses over V such that 

• No clause contains more than two unnegated variables, 

• No clause contains more than two Y variables, and 

• No clause contains both a Y variable and an unnegated X variable. 

This class contains functions that are representable neither as Horn sentences nor as 2-CNF 
formulas 2 . For example, taking X = {a} and Y = {6, c} shows that the formula (->a V 6 V c) is 
in BCH, although it is representable neither as a Horn sentence nor as a 2-CNF formula. 

Taking X — V shows that BCH contains Horn sentences. The first condition in the definition 
is satisfied because Horn sentences permit at most one unnegated variable per clause; the last 
two conditions are trivially satisfied because there are no Y at all. 

Taking Y - V shows that BCH contains 2-CNF formulas. The first two conditions of the 
definition are satisfied because no clause of a 2-CNF formula has more than two variable; the 
last condition is satisfied because there are no X variables. 

Thus, BCH properly contains the union of Horn sentences and 2-CNF formulas. Although 
this class was not previously known to be learnable, we have the following result. 

Corollary 54 The class BCH is exactly learnable (using time polynomial in the expert's de- 
scription's size) from subsumption and equivalence queries. 

Proof: Boros et al. showed that the satisfiability problem for BCH is decidable in polynomial 
time. Next, observe that BCH is closed under variable projection and clause deletion. Thus by 
Theorem 52, CI EN Blue is a polynomial time learning algorithm for BCH. □ 

2 The class 2-CNF consists of conjunctions of clauses, cacli clause containing at most two literals. 
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4.4 Queries About Proof Length 

As a final note we consider the notion that an expert may be able to deduce that some fact 
follows from his description of the world by constructing a simple proof quickly, but beyond some 
point the complexity of constructing a proof that a particular fact follows from his description 
exceeds his abilities. We model this provable/unprovable by the length of the proof; we assume 
that the expert can construct a proof of length k almost instantaneously. In such a situation, the 
expert's "yes" response actually means that the clause presented by the appretice is subsumed 
by some clause whose proof is at most k resolutions long. We call this a proof-length k query. 

By adding the property of being closed under resolution to the hypothesis of Theorem 52 
we easily obtain the following corollary. 

Corollary 55 Let k be fixed. Let T be any class of CNF formulae such that 

• T is closed under resolution, 

• T is closed under variable projection, 

• T is closed under clause deletion, and 

• the satisfiability problem for T is decidable in polynomial time 

Then given equivalence and proof-length k queries, T is exactly learnable in time polynomial in 
the expert's description's size and exponential in k. 

Proof: (Sketch) Since T is closed under resolution, assume that all resolvents of proofs of 
length at most k are added to the target and apply Theorem 52 to this padded target. □ 

4.5 Discussion 

This chapter provides a sufficient set of conditions under which a polynomial time learning 
algorithm using equivalence and subsumption queries is- guaranteed to exist. Which, if any, of 
these conditions is also necessary? 

Theorem 52 can be viewed in different ways. Strictly speaking from the view of computa- 
tional learning theory, the subsumption queries used by Generic cheat they explicitly request 
information about the representation of the target rather than merely information about the 
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function represented by the target. An entailment query - "Is this clause entailed by the tar- 
get?" - is a non-cheating query related to subsumption. Is there a simple set of sufficient 
conditions under which a theorem analogous to Theorem 52 holds where entailment rather 
than subsumption queries are used? 

On the other hand, questions of entailment may demand lengthy proofs from the expert, 
whereas, operationally speaking, answers to questions of subsumption would manifest them- 
selves as the amount of time it takes the expert to answer an entailment question. Thus from 
a practical standpoint, any delay from the expert in answering "Is this clause true?" can be 
interpreted as answer of "no" to the question of "Is this clause subsumed?" Thus, by taking 
advantage of an expert caching an efficient description (as opposed to a minimal description) 
the apprentice acquires an accurate and efficient description from the expert - even when the 
expert is unable to articulate that description. Along this line of discussion, an obvious ques- 
tion to investigate is psychological; do experts cache efficient descriptions in representation 
languages amenable to acquisition by Generic? 
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Chapter 5 

Consistently Ignorant Teachers: 
Membership by Consensus 

Most of the theoretical work models the interaction between the learner and the environment 
by an omniscient oracle (or teacher) that classifies all objects as positive or negative examples 
of the concept to be learned. Thus, it is assumed that there is a well-defined border separating 
positive examples from negative ones. In practice, though, classification is often unclear. For 
example, suppose a robot operating in an assembly plant must determine whether a part is 
defective. While some parts will be clearly defective and some clearly not defective, there may 
be some parts for which the teacher cannot decide. As another example, there are situations 
in which the classification of some objects is not well defined: An algorithm designed to read 
handwritten cheques will likely encounter many handwritten characters that look somewhat 
like a "4", and somewhat like a "9". In such cases, where even an expert does not have the 
knowledge to classify all objects, determining which objects are unclassifiable seems at least 
as important as determining the classifications of objects which are classifiable. From the 
learner's perspective, the regions of the example space that defy classification create a blurry 
border between the positive and negative examples that the learner must determine. 

In this chapter, we introduce a new formal learning model in which the teacher (or environ- 
ment) with which the learner interacts has incomplete information about the target function 
due to intrinsic uncertainty or due to gaps in the teacher's knowledge. The key requirement 
we place on the teacher is that all examples (or objects) labeled with *T' (indicating unknown 
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classification) are consistent with the teacher's background knowledge about the class to which 
the unknown function belongs. In particular, the classification of any exampte labeled with 
should not be determinable from the positive and negative examples, and knowledge of 
the concept class. (Thus the teacher is Consistently ignorant".) Observe that the examples 
labeled with *?" can be arbitrarily far away from the original border — they are just required 
to be consistent with the other labels and the background knowledge of the class. The goal of 
the learner will be to learn a good approximation to the knowledge of the teacher. Namely, the 
learner must construct a ternary function (i.e. with values {0,1,?}) that, with high probability, 
classifies most randomly drawn examples exactly as the teacher does. 

Next we introduce the notion of an agreement of concepts from a concept class C, and 
show that the problem of learning concepts / 6 C from a consistently ignorant teacher can be 
modeled as the problem of learning agreements of concepts from C. As a third characterization, 
we show that any blurry concept can be represented as the agreement of a finite union and 
intersection of concepts from C. We then show that for any concept class C for which PAC 
learning algorithms are known, these algorithms can be used to build an algorithm for learning 
the agreement of nested concepts froiuC. For the problem of learning the agreement of concepts 
from C that are not necessarily nested, we show that if the intersection and union of arbitrarily 
many concepts from C is learnable, then C is learnable from a consistently ignorant teacher. 
While, often, algorithms are not known for learning unions and intersections of concepts from 
C. under certain conditions it is still possible to learn the agreements of concepts from C. For 
example, consider a class C for which the intersection of concepts from C is learnable, yet there 
is no known algorithm to learn the union of concepts from C. In some cases it may still be 
possible to learn C from a consistently ignorant teacher by using information gained by learning 
the intersection of concepts in the agreement to aid in learning the union of these concepts. The 
learner's ability to use intersection (union) information to obtain positive results for learning 
unions (intersections) of concepts from classes for which no algorithms are currently known 
is intriguing. To illustrate the limits of this approach, we show that learning the agreement 
of an arbitrary number of Horn sentences is as hard as learning DNF. Thus, assuming DNF 
cannot be learned in the standard model (permitting membership queries), propositioual Horn 
sentences, while learnable in standard models from omniscient teachers, cannot be learned from 
consistently ignorant teachers. 
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5.1 Background and Related Work 

Most previous research on concept learning assumes for any / G C and x G X that either 
f(x) = 1 or /(x) = 0. In these situations the border between the positive and negative 
examples is well defined. There has been work addressing the issue of mislabeled training 
examples [14, 69, 97, 64] and some addressing the issue of noise in the attributes [94, 51, 74]. In 
these situations, the border between the positive and negative examples may appear blurry to 
the learner, but this is just the result of the noise process that has been applied to the properly 
labeled example. There has also been some work considering learning from noisy membership 
queries [50, 91]. 

Angluin and Slonim [12] introduced a model of incomplete membership queries in which 
each membership query is answered "don't know" with a given probability. Furthermore, this 
information is persistent — repeatedly making a query that was answered with "don't know" 
always results in a "don't know" answer. As in their work, one of our goals is to model the 
situation in which the teacher responding to the learner's queries is not omniscient. Observe, 
that in Angluin and Slonim's model since the teacher is randomly fallible, there is no guarantee 
that ail of the teacher's knowledge about the target concept is used in answering queries. For 
example, it is possible that their teacher knows that poodles are mammals, but responds with 
"don't know" when asked if & french poodle is a mammal. 1 Further, their result for learning 
monotone DNF depends very heavily on this inconsistency of the teacher. In the context of 
monotone DNF, our consistency requirement manifests itself as follows: The teacher should 
know that adding positive attributes to an already positive example yields a positive example. 
(Dually for negative examples.) Thus, in the standard boolean lattice defined over variable 
assignments, all positive examples are above all unknown examples, which, in turn, are above 
all negative examples. In Angluin and Slonim's algorithm for learning monotone DNF, if the 
teacher replies that f(x) =? then the learner samples below x in the boolean lattice for some 
(known) positive example y, implying that x is a positive example. If none are found, the learner 
concludes with high probability that x is a negative example. Thus, the teacher's ignorance 
is not consistent with the knowledge that the target function is monotone; the learner can 

1 hi our view, the notion of an incomplete membership oracle seems to better model noise than it models 
incompletp knowledge. Indeed, they note that their algorithm for learning monotone DNF with an incomplete 
membership oracle can be used to learn monotone DNF with random one-sided errors. 



determine the underlying boolean function by deducing what the teacher does not (but should) 
know. In our model, this would not be possible, as the teacher's lack of knowledge is consistent; 
the best that the learner can do (and what we demand that the learner do) is to learn which 
examples are positive, which negative, and which are unknown. 

In other related work, Kearns and Schapire [67] generalized the PAC setting to non-binary 
values using Haussler's framework [56], They define a p-concept in which each example x € X 
has some probability p(x) of being classified as positive. An observation consists of an example 
x drawn randomly according to D and then - independently classified as positive with probability 
p(x) and negative with probability 1 — p(x). In their model, the goal of the learner is to make 
optimal predictions, or more commonly, to accurately predict p(x) for all x € X , The goal of 
the learner in our proposed model has similarities with the p-concepts model. However, here we 
are interested in learning problems for which the learaer need just determine whether p(x) = 0, 
p(x) — 1, or 0 < p(x) < 1. (If a written numeral is sometimes identified as "4" and sometimes as 
"9", the learner just wants to know this — it does not need to determine what percentage of the 
population would call the numeral each value.) Similarly, our work differs from the literature 
on fuzzy sets in that we do not quantify degrees of membership. (It differs perhaps even more 
in our application of the learning models from computational learning theory.) 

5.2 The Model of a Consistently Ignorant Teacher 

We now formally define our model of learning from a consistently ignorant teacher. A blurry 
ternary concept /? is created by taking any / from the base class C and changing a set of 
examples Q C X from their current value to "?" indicating that the teacher does not know 
their classifications. Furthermore, we require that this be done consistent with the knowledge 
that / was chosen from C: If every concept / € C consistent with the labels of examples from 
X -Q, labels q as positive (respectively, negative), then /? cannot label q as "?". More formally: 

Definition 56 Let {0, 1, ?}, and let P = {x | f ? (x) = 1}, N = {x | / ? (x) = 0}, and 

Q = I /?(&) ='}■ Then /? is a blurry concept forC if for every q €Q, there exists functions 
fo and f x in C such that: (I) for all x € PJo(x) = f t {x) = 1, (8) for all x € NJo(x) = 
f { (x) - 0, and (3) fo(q) -0^1- f\(q). We define the blurry concept class C-> = {/? | f> is a 
blurry concept for C}. 
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Thus for any concept class C, the class C? contains exactly those blurry concepts tha f can 
be generated from some / G C. For a target function /, we say that an example x G X is a 
positive example if f(x) = 1, is an unknown example if f(x) = ?, and is a negative example if 
f(x) = 0. We assume that random examples are chosen (by nature) from an unknown, arbitrary, 
distribution Z?, and are then given a label from {0, 1,?} by the teacher, and presented to the 
learner. Using the obvious extension of the PAC and PAC with membership query models we 
say that the learner has successfully learned /? G C? if with probability at least 1 - £, the 
(ternary) hypothesis output by the learner has probability at most e of disagreeing with /? on 
a randomly drawn example from D. If such a polynomial-time learning algorithm exists, we 
say that the blurry class C? is learnable, or equivalently, that the class C is learnable from a 
consistently ignorant teacher, in the PAC or PAC with membership query models. Finally, note 
that one way a hypothesis h might err, is if h{x) =? and h(x) Thus, "?" does not mean 
"don't care". 

5*3 An Alternate Formulation of The Model 

To understand some complexity issues involved in learning from consistently ignorant teachers, 
we consider when C is the class of pure conjunctive concepts (monomials) — each concept is 
a simple conjunction of variables or their negations. Let P, Q, and N be the set of positive, 
unknown, and negative examples, respectively, for some blurry monomial. In this case, it is 
straightforward to show that P must be representable as a (nonblurry) monomial m. Further, 
it is not difficult to show that PU Q cart be represented by a unate 2 DNF that contains only 
those literals appearing in m (provided P is not empty). These observations are sufficient to 
construct a PAC with membership query algorithm to learn the class of blurry monomials (for 
which P is nonempty): run a known algorithm for learning (non-blurry) monomials [98] to 
learn the set P of positive examples, and at the same time run a known learning algorithm for 
unate DNF [9] to learn the set P U Q of nonnegative examples. Then Q and /V can be easily 
determined from knowledge of P and P UQ. 

Is this a polynomial time algorithm? It depends on our choice of complexity parameters. 
By definition, there is some "underlying" boolean monomial m (of size |m| < u, the number 

2 A nnate formula is otu; in which no variable appears both negated and un negated. 
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of variables) that has been turned into a blurry concept. So perhaps \m\ is a reasonable size 
measure for this blurry concept, and the above algorithm needs to run in time polynomial 
in |mj to be considered efficient. However, as we observed, the learning problem is not that 
of determining some underlying boolean concept, but that of determining the ternary blurry 
concept, v/hich requires learning iV, and thus indirectly, P U Q — X — N . A particularly 
nasty choice of "?" examples can result in a unate DNF describing the set P U Q that has 
a number of terms exponential in n (hence, exponential in the size of. any monomial fror*i C). 
Instead of relying then, on the size of some uivfierlying concept, we capture the complexity of 
a blurry concept by reformulating the notion as that of an agreement of base concepts, and let 
the complexity rely on the complexity of the individual base concepts forming the agreement. 
Let F be a finite set of boolean functions. The function Agree is a ternary function whose 
classification on example x G X is given by 



We now argue that the problem of learning agreements of concepts from C is equivalent to 
learning C from a consistently ignorant teacher, or equivalently, learning the blurry class C?. 
The notion of an agreement of base concepts has independent interest, as it models a type of 
unanimous vote of independent agents. The following lemma shows that being consistently 
ignorant is no different than being indecisive in the face of competing hypotheses. 

Lemma 57 For any class C of boolean concepts^ the blurry class C? = {Agree F j F C C}. 

Proof: We first show that any /? £ C? can be expressed as an agreement of concepts from C. 
Let P = {x | = 1}, Q = {x | =?}, and N = {x \ = 0}. We construct a set 

of concepts F from /? such that for all x 6 X, = Agree F (x). Initially, let F = 0. Now 

for each x G X for which =?, we add to F the par of functions /o and f\ described in 

Definition 56. By definition, / 0 and f\ exist for each x £ Q. Since all concepts placed iu F 
correctly classify all examples in P, it follows that for any x such that f(x) - I, Agree F (x) = 1. 
Likewise, for all x such that f{x) = 0, Agree F (x) = 0. Finally, observe that for any x such that 



Agree F (x) = < 



1 if f(x) = 1 for each / Gf, 
0 if f(x) = 0 for each / G F, 



otherwise. 
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f(x) =? we have placed in F two concepts (/o and f\) that disagree on the classification of x 
and thus Agree F (x) =?. 

We now show that for any FCC there is a blurry concept /? € C? that is logically equivalent 
to Agreejr. Select any / G F as the target function from which /? will be created. Observe 
that by the definition of Agree^, for all examples x that are classified by Agree F as positive 
or negative, Agree F (x) = f(x). Let Q C X be the examples classified as "?" by Agree F . 
Since for any x € Q there must be two concepts in F that classify all examples in X — Q 
correctly but classify x differently, it follows that for each q £ Q> f{q) cannot be computed 
from {{x>f{x) | x S X - Q}. Thus the blurry concept /? obtained from / by changing the 
examples in Q to is logically equivalent to Agree F . □ 

Hence, the problem of learning blurry concepts C? generated from a base class C is equivalent 
to the problem of learning the agreements of sets of concepts from the base class C, (and 
equivalently, learning C from a consistently ignorant teacher). Using this correspondence, we 
obtain a complexity measure for the size of/?: First, define the representation size of a subset of 
concepts F to be YlfeF I/!- N° w define the size of/? <E C? (denoted by |/?|) to be the minimum, 
over all subsets F C C for which Agree F = /?, of the representation size of F. 

5,4 Positive Results for Learning Agreements 

We show that PAC and PAC with membership query learning algorithms can be designed 
to learn from consistently ignorant teachers. We first consider the problem of learning the 
agreement of a pair of nested concepts. We show that if both concepts are chosen from classes 
for which learning algorithms exist, then we can use these algorithms to obtain an algorithm 
for learning the agreement of the functions. We then present a general result addressing how 
known algorithms for learning from omniscient teachers can be applied to learn from consistently 
ignorant teachers even when the base functions are not nested. We apply this technique to 
obtain positive results for learning (under certain conditions) the agreement of monomials, 
monotone DNF formulas, and DNF formulas with a constant number of terms. 
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5.4.1 Learning Agreements of Nested Concepts 

Observe that a concept / £ C can be viewed as a characteristic function denoting the subset 
of examples from X that / classifies as positive. Thus for two concepts f\ and /2, we write 
fi Q h if the set of positive examples of f\ is a subset of the positive examples /2. Given a set of 
concepts F = {/i, . . //-} we say that these concepts are nested if f\ C /2 C • • • C Observe 

that in such cases Agree^ ^ = Agree^ ^j. and thus, without loss of generality, we consider 

learning the agreement, Agree^j s }, of two nested functions f 9 and f 9 (s and g for "specific" 
and "general"). Suppose these are chosen, respectively, from known polynomial-time PAC with 
membership query learnable concept classes Cs and Co- Then the learning algorithms for Cs 
and Cg can be used to learn the following blurry concept class: 

Nested ? (Cs,C G ) = {H^{/ 9 j g } I Is G C S J 9 G C G ,and /, C f g }. 

Theorem 58 IfCs andCG are PAC with membership query (respectively PAC) learnable con- 
cept classes, then Nested?(Cs,CG) is PAC with membership query (respectively PAC) learnable. 

Proof: Let As (respectively Ag) be the PAC with membership query algorithm for learning 
Cs (respectively Cg). We obtain an algorithm A (figure 5.1) to learn Nested?{C 5 ,CG) by 
simultaneously running As trc-ating "?" as "-" to obtain hs, and running Ag treating "?" as 
w +" to obtain /ig, demanding error at most e/2 with confidence at least 1-5/2 from each. 
Then output h = Agree^ ^ as the final hypothesis. 

We now argue that the hypothesis h output by A has error at most 6 with probability at 
least 1-5. By the correctness of As, the probability that the error of hs is greater than e/2 is 
less than 6/2. Likewise, the probability that the error of /ig is greater than e/2 is less than 8/2. 
Clearly the error of h is at most the error of hs plus the error of /ig, and thus the probability 
that the error of h is at most e is at least 1 - 6. Since As and Ag run in polynomial time, 
it immediately follov^, that .4 runs in polynomial time. Finally, note that .4 only makes a 
membership query when either As or Ag does. □ 

We note that, under the assumption that the evaluation problems for C$ and Cq are 
decidable in polynomial time, the stronger exact learning version of Theorem 58 can be obtained 
a u ?" example returned from an equivalence query are given to As when it would serve as a 
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Learn- Agreement-Nested-Concepts^, e, 6) 

1 Let F := {f,Jg} such that f s C 

2 Let As be the PAC with membership query learning algorithm for C5. 

3 Let Ag be the PAC with membership query learning algorithm for Cq. 

4 Simulate As (with parameters e/2 and 6/2) as follows: 

5 If .4s requests an example, then draw a random labeled example (x,f(x)) from 
D. 

6 If As performs a membership query member(x) then perform a membership 
query on a: to obtain f(x). 

7 If f{x) = i then give (x, 1) to As, 

8 Else give (x,0) to As- 

9 Let hs be the hypothesis output by As> 

10 Simulate Ac (with parameters e/2 and 6/2) as follows: 

11 If Ag requests an example, then draw a random labeled example (x,/(a;)) from 
D. 

12 If Ag performs a membership query member(a:) then perform a membership 
query on x to obtain f(x). 

13 If f(x) = 1 or /(a:) = "?" then give (a;, 1) to A c , 

14 Else give (x,0) to Ag- 

15 Let /ig be the hypothesis output by Ag> 

16 Return the hypothesis Agree {-^. ^ G y 

Figure 5.1: A method for learning the agreement of nested concepts. 
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negative counterexample to /is, otherwise it must be a positive counterexample for ha- Positive 
and negative examples returned from an equivalence quer- are handled similarly. 

Another modification of the above algorithm produces an on-line learning form of Theo- 
rem 58. More specifically, the modified algorithm receives an arbitrary sequence of examples 
x, and is asked to predict The algorithm may ask membership queries (about examples 

y ^ x). Then it predicts the value of The modified algorithm spends only polyno- 

mial time to output each next prediction, and the number of times it will predict incorrectly, 
regardless of the sequence of trials, is bounded by a polynomial in tz, |/ 5 | and \f g \. 

5,4,2 A General Technique for Learning Agreements 

We now give a general technique that allows Theorem 58 to be applied to arbitrary blurry 
concepts. We show how an arbitrary agreement of concepts from a class C (and hence, an 
arbitrary blurry concept /? from C?), can be represented, without significant increase in size, as 
the agreement of two nested concepts, one of which is an intersection of concepts from C, and 
the other is a union of concepts from C. 3 Thus when unions and intersections of concepts from 
C are learnable, then the blurry class C? will be learnable. We then apply this technique to 
show that, with some restrictions, the class of monomials, monotone DNF formulas, and DNF 
formulas with a constant number of terms are learnable from a consistently ignorant teacher. 

Definition 59 Let F be a finite set of boolean functions. The function Union? is a boolean 
function which classifies example x as "1" if f(x) = 1 for some f E F and as "0" otherwise. 
Likewise, Intersect? classifies x as "1" if f(x) = 1 for each f E F and as u 0" otherwise. 

To discuss algorithms for learning unions or intersections of concepts from a given class, 
we must provide a size measure for each concept in the class. As we defined |Agree/r|, we 
define | Unions] (respectively, (Intersect /r|) to be the minimum, taken over ail F' C C for which 
Unions = Unions (respectively, Intersect/?' = Intersect/?), of the representation size of F 1 . 
Each concept Agree/r in the class Cy is equivalent to the agreement of two concepts - Union/? 
and Intersect^, and the size of this alternate representation is at most twice the size of the 
original representation. 

3 There is an interesting relationship, discussed in Section 5.6, between this observation and Mitchell's version 
space algorithm [80]. 
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Lemma 60 If F is a finite subset of concept class C, then Agree F — ^9 ree {l n tersect Ff Union F } } 
and the size of Agree {l n tersect F Union F ) 25 °* mos ^ twice the size of Agree F . 

Proof: The assertion about size holds by definition, as for both functions the determining 
value is the size of the set F, that appears once for one function, and twice for the other. 

The functions are equivaient: note that for any example x, Agree|j nt;ersect .^ Unionp}^) = ^ 
if and only if Intersectfr(a;) = 1 since Intersect jr(a;) < Unionfr(a;). It therefore follows that 

A 6 ree {Intersect F ,Union F }( x ) = 1 if and oni >' if for all / G F , f{x) = 1. But, by definition, 
this is exactly when Agree^x) = 1. An analogous argument shows that the two functions 
are identical when x is a negative example. Finally, since they agree as to which examples are 
classified as positive and negative, they must also agree as to which examples are to be classified 
"?". □ 

We now use this characterization to obtain an algorithm for learning from a consistently ig- 
norant teacher when finite sets of unions and intersections from the given class are known 
to be learnable. To aid the exposition, we introduce Cn to denote the set {Intersect : 
F a finite subset of C}, and Cu to denote {Union jr : F a finite subset of C}. 

Theorem 61 Let C be a concept class for which Cn and Cu are PAC with membership query 
(respectively PAC) learnable. ThenCi is PAC with membership query (respectively PAC) learn- 
able. 

Proof: For any collection F of concepts, Intersect^ C Unions, and thus by Theorem 58, if 
C n and Cu are PAC with membership query learnable then so is {A.gree|j ntersectfi Union^> : 
F C C}. Combined with Lemma 60 we get the desired result. As in Theorem 58 this result 
also applies when no membership queries are provided to the learner. □ 

The following corollary shows that the agreement of monomials with nonempty intersection 
is learnable. 

Corollary 62 Let C be the class of monomials, and let C$ = {c : c € C?, 3x c(x) = 1}. Then 
is PAC with membership query learnable. 

Proof: The class C n is learnable since C is closed under intersection and known t.o be learn- 
able [98]. If F C C is a subset for which there is some example x such that Intersect/.^ ;c) = U 
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then x satisfies every monomial in F and so it cannot be the case that some variable appears 
both negated and unnegated in F. Thus Unions is a unate DNF formula, that is PAC with 
membership query learn able [9]. It is also easily verified that for any finite FCC, the size of 
the representation of Intersect^ as a monomial and the size of the representation of Unions as 
a unate DNF formula are each 0(]C/eFl/l)- Thus by Theorem 61, Ct is learnable. □ 

Note that in the above corollary, had we not thrown out those blurry concepts of C? for 
which there were no positive examples, our proof would fail because it may not be the case that 
Unionjr is unate. Thus, we would face the task of learning the class of arbitrary DNF formulas. 
Also note that the corollary applies to the dual of monomials (i.e. 1-DNF) when we discard 
blurry concepts that have no negative examples. 

We next consider agreements, unions, and intersections, of at most a constant number k of 
concepts from a class C. Let C?(k) = {Agree F : F C C,|F| < fc}, C n {k) = {Intersect^ : F C 
C, IjF] < fc}, and C u (k) = {Unions : F C C,\F\ < k}. Applying the known learning results for 
monotone DNF [98] and DNF formulas with a constant number of terms [1, 18, 21] and the 
known learning results for decision trees [29] and DFAs [3], we obtain the following corollary. 
(The corollary follows because the intersection and union of a constant number of concepts 
from each of the preceding classes can be represented by a single concept in the corresponding 
class that is at most polynormally larger.) 

Corollary 63 Let C be the class of monotone DNF (t-term DNF, decision trees, DFAs) formu- 
las. Then for each constant h, C?(fc) is PAC with membership query learnable. The dual results 
for monotone CNF formulas and t-clause CNF formulas also hold. In the case of decision trees, 
the hypothesis space is conjunctions of unate DNF. 

5.4.3 Learning Unions of Boxes in Euclidean Spr « 

We now turn our attention to geometric concepts. In this section we give an algorithm to learn 
the agreement of a set of 3 axis-parallel boxes in d-dimensional Euclidean space (E d ) when the 
set of boxes have a samplable intersection. Throughout the remainder of this section, unless 
otherwise specified, by a box we mean an axis-parallel box in E d . It is easy to show that this 
class is a generalization of unate disjunctive normal form formulas, and a specialization of the 
class of unions of boxes in E d . 
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Previous algorithms that PAC learn an 5-fold union of boxes in E d include: Long and 
Warmuth [75] that runs in time polynomial in d only and Blumer et al. [23] that runs in time 
polynomial in s only. There has also been work on learning unions of s boxes in the discretized 
space {1, . . ., Most of this work has focused on the special case in which d = 2. Chen 
and and Maass [32] gave an algorithm to learn the union of two axis-parallel rectangles in 
the discretized space {1, . . . , n} X {1, . . . , ra} in time polynomial in log n and log m, where one 
rectangle has a corner in the top left corner and the other has a corner in the bottom right 
corner. While learning the union of these two rectangles within these time bounds was difficult, 
learning the agreement of the rectangles is quite simple since the learner needs only learn the 
intersection of the two rectangles which is easily achieved. 

Chen [30] gave an algorithm that uses O(log 2 n) equivalence queries to learn the union 
of two rectangles in the discretized plane (i.e. {l,...,n} 2 ). Also, Chen and Homer [31] have 
given an algorithm to learn the union of s rectangles in the discretized plane using O(s 3 log n) 
membership and equivalence queries and O(s 5 logn) time. More recently, Goldberg, Goldman 
and Mathias [49] have given an algorithm to learn the union of 5 discretized boxes in {1, . . . , n} d 
that makes at most sd+1 equivalence queries and uses 0((8s) d + sdlgn) time and membership 
queries. Note that their algorithm, like the PAC algorithm of Long and Warmuth [75], only 
runs in polynomial time for d constant. 

The algorithm we present here is a PAC with membership query learning algorithm for the 
agreement of s boxes in E d that runs in time polynomial in 1/e, 1/5, s, and 9 d . Thus, the 
algorithm runs in polynomial time without demanding that one of s and d be constant (e.g. 
d can be 0(logs)).A key algorithm we use in learning the agreement of boxes is a PAC with 
membership query learning algorithm for the union of a set of boxes that all lie in the same 
quadrant of E d y and for which the intersection region contains the origin. We call each box in 
such a set an origin-incident box because each such box touches the coordinate axes in such 
a way that the box contains the origin as a corner. Our algorithm to learn the union of $ 
origin-incident boxes runs in time polynomial in both d and s. 

To aid in learning the agreement of boxes, we also use the known algorithm for computing 
the intersection of boxes [23]. Namely, we first learn an approximation for the intersection 
region by applying the standard algorithm with all examples treated as negative. Since the 
boxes have a non-empty intersection, we can subdivide E d into at most 3 rf sub-regions based 
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on this common intersection. Each sub-region can be translated and relabeled so that we can 
apply our algorithm for learning the union of origin-incident boxes. In the worst case, some 
piece of each of the s boxes will lie in each of the 3 rf regions of the sub-divided problem forcing 
us to learn 0(s3 d ) boxes. 

It is important to note that in obtaining our algorithm to learn the agreement of boxes we 
take advantage of our ability to efficiently compute the intersection region and then use this 
information to aid in more efficiently learning the union of the boxes. It is uncommon for both 
intersections and unions of concepts to be learnable, and thus, the possibility that information 
from one of these could be used to learn the other is of particular interest. 

5.4.3.1 Learning the Union of Origin-incident Boxes 

We present an algorithm to learn the union of s origin-incident boxes in E d where all of the 
boxes are in the same quadrant (for simplicity we only present the algorithm for the positive 
quadrant). We refer to the class of origin-incident boxes in the positive quadrant as BPQ. We 
define the upper corner of a box b G BPQ to be the corner of the box diametrically opposed 
to the origin. Since any box in BPQ is uniquely identified by its upper corner, we denote an 
origin-incident box by box(p) where p is its upper corner. Finally, we define maxCorner to be a 
function that takes a set of points in the positive quadrant of E d and returns the upper corner 
of the smallest box in BPQ that contains every point in the set. 

In the PAC model, an important contribution in characterizing what concept classes are 
learnable was made by Blumer et al. [23]. A finite set S C X is shattered by C if for each 
subset S" C 5, there is a concept / 6 C that contains all of 5" and none of S - S". The Vapnik- 
Chervonenkis dimension of C, denoted vcd(C), is defined to be the largest d for which some set of 
d points is shattered by C. Building on the work of Vapnik and Chervonenkis [101], Blumer et al. 
proved that any PAC learning algorithm must draw at least ft(^ln| + vc ^ g ) ) examples. 
Furthermore, they proved that the general technique of finding a hypothesis consistent with 
a set of 0(7 In | + vc t PC In \) examples, when feasible, always results in a (possibly super- 
polynomial time) PAC learning algorithm' 1 . We use this result to show that BPQJs) is PAC 
with membership query learnable. 



4 There arc some technical restrictions on the concept classes for which this result applies. 
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LearnBPQ(5) 




/* S is a labeled sample. */ 




/* This algorithm will, with probability at least 1-5, output a hypothesis with error 




<ti iiiobi t given mac \o\ ^ max^— log - iog j. / 


1 
1 


ft := 0 /* The set of boxes in the hypothesis; represented as upper corners */ 


o 


P *= {x i x £ «5\ x is a positive example} 


O 


while there exists an example x £ P 


4 




5 


for each y G jP if member(maxCorner{a;, y}) — ik yes" then 


6 


x := maxCornerja:, t/} 


7 


P:=F-{y} 


8 


add box(ar) to h 


9 


return h /* That is, output the union of boxes in A*/ 



Figure 5.2: Algorithm to learn a union of origin-incident boxes. 



Theorem 84 Lei BPQ^s) be the union of at most s origin incident boxes. The class BPQ u (s) 
is PAC with membership query learnable with time and sample complexity polynomial in s, d, 
1/6, 1/6. 

Proof: To prove the theorem, we show that 

1. Algorithm LearnBOQ (Figure 5.2), takes as input a sample 5, runs in time polynomial in 
5, and outputs a union of at most s origin-incident boxes (that is, an element of BPQ,j(,s)) 
that is consistent with the sample. 

2. The VC-dimension of BPQ U (5) grows polynomially with $ and d (in particular, it is at 
most 2cfslog3s). 

It then follows from Theorem 2,1 of Blumer et al. [23] that if LearnBOQ is given a sample of 
cardinality at least ro = max log f , \ og ia| f then witu probability at least 1 - 4\ it 

will output a hypothesis h with error at most e. 

To see that (2) is true, we that the VC-dimension of BPQ is at most d. A point can 
be uniquely excluded from a set of points by a box in BPQ box only if at least one of its 
components is the largest among all points in the set for that component; since there are only d 
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components, the VC-dimension can be at most d. 5 Then by Lemma 3.2.3 of Blumer et al. [23], 
the VC-dimension of BPQ^s) is at most 2dslog(3s). To complete the proof, it remains to be 
shown that (1) holds. We first show that LearnBOQ produces a hypothesis that is consistent 
with the sample 5. The hypothesis produced is consistent with the positive examples of S since 
the algorithm* does not terminate until all positive examples of S have been removed from P 
and no point is removed unless the box about to be placed in h contains it. Furthermore, if 
box(x) was placed in h y then x was a positive example (either it was in P or verified to be 
positive with a membership query). Since x is a positive example, box(ar) is contained within 
some box of the target. Thus no negative points (even those not in S) can be contained in any 
of the boxes placed in h. 

We now prove that the hypothesis h output by LearnBOQ contains at most s boxes. Suppose 
h contained more than s boxes. Since each box of h is contained within a box of the target, 
it follows that there must be at least two boxes (say b{ and bj) in h that are contained within 
the same box (say b* k ) of the target. Assume, without loss of generality, that was placed in 
h first. Let pi be the point from P selected in step 3 during the iteration of the while loop in 
which bi was added to h. Thus pi must be contained within b{. Likewise, let pj be the point 
from P selected durin 0 the iteration of the while loop in which bj was added to h. (So pj is in 
bj.) Since pj € P after 6,- was placed in h, a membership query must have been performed on 
maxCorner{pJ,pj}, where box(p() contains p*. Furthermore, since pj was not removed during 
the construction of 6 t -, it follows that maxCorner{p(,pj) is a negative example. Since pi is 
contained within box(pj) it must be that maxComer{p t -,pj} is also a negative example. Recall 
that the box b* k of the target contains b{ and bj and thus b* k contains pi and pj. However, this 
contradicts the fact that maxCorner{p,-,pj} is a negative example. Thus h contains at most 6* 
boxes. 

Finally, note that LearnBOQ runs in polynomial time, since there are at most s iterations 
of the while loop, each of which takes at most O(m) time, where m is the cardinality of S. This 
completes the proof of (1) above, and hence of the theorem. 

□ 

5 In fact, the VC-dimension is exactly </ because any subset S of the set of points {p, : p K = (jci , . . . , r, = 

1=5 1} can k c selectively excluded from the set by the box 6 £ BPQ having upper corner (z\ xj) 

where r, e s = 2 and £ t $s = 4. 
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We note that LearnBOQ is easily modified to obtain an algorithm to learn the union of s 
origin-incident boxes in E d when ail of the boxes are in any single quadrant. 

5.4.3.2 Learning the Agreement of Boxes with Samplable Intersection 

In this section we give an algorithm to learn the agreement of $ boxes in E d (hence, an algorithm 
to learn boxes from a consistently ignorant teacher) when the intersection region is samplable. 
Our algorithm has polynomial time and sample complexity in both d and $ when d = 0(log$). 
The intuition behind our algorithm lies in the way in which the non-empty intersection of a 
set of boxes can be used to partition E d into 3 d sub-regions. Let B be the set of boxes for 
which we are computing the agreement. Figure 5.3 illustrates the effect of this partitioning on 
a typical box 6 6 B. The large, transparent box is b 7 and the solidly shaded box 6/ in the 
center is the intersection of all boxes in B (and thus must be contained in Z>). By infinitely 
extending the faces of 6/ we decompose b into a set of sub-boxes that are also axis-parallel. A 
few of the sub-boxes generated by this extension have been cross-hatched. Notice that some of 
the sub-boxes generated in this manner share a face with 6/, others share only an edge, and 
still others share only a corner. Including 6/ itself, the decomposition of E 3 results in 3 3 = 27 
sub-regions. In general, there are 3 d sub-regions in E d as seen informally by first observing that 
the bounds of the intersection region are d pairs of parallel hyperplanes, one pair of parallel 
hyperplanes for each dimension. Thus, in each dimension the sub-region lies either above both 
of the hyperplanes, lies between the pair, or lies beneath both of the hyperplanes. Therefore, 
we have 3 choices for each set of bounding planes giving a total of 3 d sub-regions. 

Sub-regions and Sub- boxes. A useful way to categorize these 3 d sub-regions is by the 
dimension of the boundaries they share with the intersection region. For example, in Figure 5.3 
a sub-regiou that shares a face with the intersection region shares a 2-diinensional boundary. A 
sub-region that shares an edge shares a 1-dimensional boundary, and a sub-region that shares 
only a corner shares a O-dimensional boundary. 

Since, by definition the intersection region is contained in every box in B y the manner 
in which these boxes can overlap in a sub-region is restricted based on the dimension of the 
boundary that the sub-region shares with the intersection region. Figure 5.4 illustrates the 
constraints imposed by the dimension of the shared boundary the higher the dimension of the 
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Figure 5.3: Decomposition of an axis-parallel box with respect to the intersection region. 

shared boundary the greater the number of dimensions that are constrained by the intersection. 
From the figure we see that in sub-regions that share only a corner point with the intersection 
region, we have almost no information about the way the sub-boxes in that sub-region might 
be arranged. At the other extreme, in the sub-regions that share a face with the intersection 
region, we know almost precisely how the sub-boxes are arranged— the only unknown is how 
far oeyond the face the sub-boxes extend. 

To eliminate the dimensions of a sub-region that are already constrained by the intersection 
region we introduce the following notation. Let p = (xi,x 2 ,. . be a point in E d y and 

let / be a set of indices {ii , z 2 , . . . , ik} such that 1 < i t < i 2 < . . . < i k < d. Then the 
point 7r/(p) = {xi lt Xi 39 . . .,*i fc ) in E h is the projection of p with respect to /. In general, if a 
sub-region shares a fc-dimensional boundary with a (/-dimensional intersection region, then for 
any sub-box in that sub-region we need only determine the sub-box's extent in the remaining 
d-k dimensions. Equivalently, boxes in the same sub-region can be translated to all be origin- 
incident boxes in a d - k dimensional space for which we can apply LearnBOQ. Furthermore 
these d ~ k dimensions are exactly those which are not shared with the intersection region. 
Let the intersection region of the boxes in B be {{x u x 2y . . . , x d ) : w { < x { < z { } for constants 
Wi and Zi (1 < i < d). Now for any point p = (a t ,a 2 , . . .,a rf ) 6 E d , if W{ < 7r t (p) < z { and 
p is contained in some box b e D then for any point y such that W{ < ij < z iy the point 

( tt i» a 2 ttt-iiV'ttH-it • ■ is also contained in b. Thus for any point /; we can ignore any 

dimension i of/; \[ p lies in a sub-region that is bounded between the pair of parallel hyperplaues 
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Figure 5.4: Sub-region constraints imposed by the dimension of the boundary shared with the 
intersection region. 

that bound dimension i of the intersection region. This observation is used by our algorithm 
to take a collection of points for a specific sub-region, project out any dimension for which the 
suVregion is bounded, and then learn all the sub-boxes in that sub-region using a version of 
LearnBOQ of suitable dimension for the resulting projected and translated points. 

Removing Intersection Box Estimation Error. There remains a subtle point that we 
must address. So far we have assumed that we know the intersection region exactly. How- 
ever, L r^.Uty, we apply a known PAC algorithm [23] to obtain a good approximation of the 
intersection region; the approximation box is contained in the intersection region. To obtain 
an approximation with error at most e with probability at least 1 - tf, this algorithm draws 
a sample of size max log |, ^ log ±p) and returns the smallest box that is consistent with 
the sample. Let IBox(5*) be a procedure that takes a sample S and returns the smallest box 
consistent with S. To apply this algorithm to learn the intersection region of the boxes in our 
model, we simply modify the sample by changing all examples to negative examples. 

The difficulty presented here is that the sub-region in which a point p lies may differ when 
subdividing based on the true intersection region versus sub-dividing based on the underestimate 
for the intersection region. Figure 5.5 illustrates how this may happen where A* is the true 
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Figure 5*5: An example assigned to the wrong sub-region. 



intersection region and A is our (underestimate) of A*; the point marked "?" lies between the 
vertical boundaries for A*, but lies to the right of the vertical boundaries for A. We handle 
this difficulty by discretizing E d with an irregular Cartesian grid. 

Suppose we have a collection S of points from E d . For each dimension i consider the set 
S{ = {tT{,}(p) : p G S}. Notice that Si is a collection of points from E l and that if we consider 
labeling the coordinate axis for dimension i of E d using only values found in 5,-, then we will 
have effectively discret.ized E d in such a way that every point of the sample S lies at some 
intersection point of the resulting irregular Cartesian grid. We then expand our estimate A 
of the true intersection region A* in such a way that for every point p in 5, the sub-region 
generated by A in which p lies and the sub-region generated by A* in which p lies are the same. 
An algorithm to achieve this goad is given in Figure 5.6. We have the following lemma. 

Lemma 65 Let A be a non-empty underestimate of the true intersection region A* . The al- 
gorithm Expand(A,5) outputs a box A f so that for all p G S, the sub-region generated by A' 
in which p lies and the sub-region generated by A* in which p lies are the same. Furthermore^ 
Expand runs in time polynomial in the size of S. 

Proof Sketch: Consider the infinitely tall, finitely wide strip in Figure 5.5 bounded between 
the right edges of A and A* in which the point lies. If this strip were samplable. Expand 
would have been likely to witness some point in the strip. Any point in the strip, when projected 
between the top and bottom edges of A would have been a positive example. Thus the right 
edge of A would have been closer to the right edge of A*. □ 
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Expand(A,S) 

/* A is a non-empty underestimate of the true intersection region. */ 
/* S is a set of labeled example points. */ 

1 Let (ai,a2,...,ad) be any point contained in A, 

2 For i := 1 to d 

3 /* Use membership queries to obtain a lower bound for dimension z. */ 

Set l{ to be the smallest v such that v = 7T{,}(p) for some p G 5 and 
(ai , a2, . . . , a,-_i , a,+i, . . . , a^) is a positive example. 

4 /* Use membership queries to obtain an upper bound for dimension i. */ 

Set Ui to be the largest v such that v = 7T{{}(p) for some p 6 5 and 
(ai,a2, . . v, a t -+i, . . . y ad) is a positive example. 

5 Return the box A' having opposing corners I2, . . . , ^) and (tii, ^2 > • * • •> ^rf)- 



Figure 5.6: Algorithm to expand an underestimate A of the the true intersection region A* 
to an estimate A f such that for any point pin S, the sub-region generated by A 1 in which p lies 
is the same as the sub-region generated by A* in which p lies. 

The Full Algorithm. Putting all the pieces together we obtain our algorithm. We first 
approximate the intersection region, and then we refine this estimate using Expand. Next 
we apply a version of LearnBOQ of suitable dimension to the points in the various sub-regions 
generated by the intersection region. Finally, we combine the hypotheses obtained from the calls 
to LearnBOQ along with our estimate of the intersection region to obtain our final hypothesis. 
The complete algorithm is shown in Figure 5.7. 

Theorem 66 LearnBoxes Agreement is a PAC with membership query algorithm for learning 
the agreement of s axis-parallel boxes in E d . Let p + be the probability of receiving a positive 
example from D. The sample complexity is m = 0 log ^ -h ^?ds log s log ^ -\- j^rlog|-), 
and the time complexity is 0($m). 

Proof: Note that by drawing a sample of size l/p + ln2/£ with probability at least 1 - 6/2 
we will obtain a positive example. It follows directly from Blumer et al. [23] that our sample 
suffices to ensure that the hypothesis output by IBox(T) has error at most e/3 d with prob- 
ability at least 1 - We now show that for each of the 3 d - 1 remaining sub-regions, 
the sample is sufficiently large so that with probability at least 1 - there are enough 
points so that the hypothesis output by LearnBOQ for that region has error at most t/3 li . 
ft immediately follows from Theorem 64 that it is sufficient to provide LearnBOQ with a 
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LearnBoxes Agreement^ ) 

1 Draw a sample S of size m := max lg M^-gW lg ^f, ^ lg £}• 

2 If there are no positive examples halt and report failure. 

3 Let T be set of examples obtained by relabeling all examples of S as negative. 

4 Set A := Expand(IBox(I\ S))\ 

6 Let R be the set of sub-regions generated by A (excluding A itself). 

6 For each sub-region r G R 

7 Choose any point p T in the boundary shared by A and the sub-region r such 
that p r is an extreme point in every dimension of the boundary. Let f r be the 
coordinate transformation that translates f r to the origin of E d . 

8 /* Identify dimensions for which we already know the extent of any sub-box lying 
in r */ 

Let I r be the dimensions for which r is not bounded between a pair of parallel 
hyperplane bounds for A. 

9 /* Project out those dimensions for any point of S that lies in r. relabel "?" 
examples */ 

Let S r := {p' :per and p 9 = 7r/ r (/ r (p))}. 

If p' G S r is labeled with "?" then relabel it as positive. 

10 Set B r to be the set of boxes returned by Learn BOQ(5V). 

11 Given any unlabeled example x, predict 

1 if x lies in A 

i ? if 3r G R>b G 5 r such that x lies in sub-region r and 7rj r (/ r (z)) lies in b 
0 otherwise 



Figure 5.7: The algorithm Learn Boxes Agreement for learning the agreement of a set of axis- 
parallel boxes with samplable intersection region. 




sample of m' = max [±f log l^^M log IMf | points from the given regiorii By 
applying Chernoff bounds from Fact 6 it is straightforward to show that a sample of size 
m = max { \vnl 3 d , S3f l n Ml | i s sufficiently large so that there for each sub-region of weight at 
least -p there are at least mf points from the sample with probability at least 1 - (Note 
that sub-regions with weight less than e/3 d can contribute at most e/3 d to the total error.) It 
follows from Lemma 65 that the total error of our final hypothesis is the sum of the errors of 
the hypotheses we generate for each of the 3 d regions. Thus the probability that the error of 
the final hypothesis is more than 3 d ■ e/3 d = e is at most 3 d = S/2. 

We now compute the time complexity. It is easily argued that the first five steps take O(m) 
time. Observe that the loop in step 6 goes over 3^-1 regions. While each of the s boxes can 
be sub-divided to have a piece in each sub-region, each point in the sample falls into at most 
one of the regions. Thus the total time for step 6 is J2r^R 0{sm r ) where ro r is the number of 
sample points that are in region r. Finally since J2r€R m r < ™> we get that the total time for 

step 6 is at most O(sm). □ 

« 

5*5 A Negative Result 

By the results of chapter 2, the class of propositional Horn sentences is known to be PAC with 
membership query learnable. We provide evidence that this result cannot be strengthened to 
allow learning Horn sentences from a consistently ignorant teacher, by showing that such an 
algorithm could be used to learn the class of DNF formulas. 

Unlike in the previous section where we used the knowledge of the intersection region of a 
set of s boxes to aid in an algorithm to learn the agreement of any set of s boxes having a non- 
empty intersection, here we show that the knowledge of the intersection region of a set of Horn 
sentences appears to be insufficient in providing the leverage needed to learn the disjunction 
(and thus agreement) of the Horn sentences. 

Let DHF represent the class of disjunctions of Horn Sentences. We begin with the observa- 
tion that the class of DNF formulas is a subset of the class of DHF formulas. 

Claim 67 For any DNF formula there exists a logically equivalent DHF formula. 
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Proof: Observe that every unnegated literal v is equivalent to the Horn clause (T v) 
and every negated literal v is equivalent to the Horn clause (v -+ F). For example, abc = 
(T a)(b F)(T -* c). Thus we can represent each term by a Horn sentence and take 
the disjunction of these Horn sentences to build a DHF formula that is logically equivalent 
to the given DNF formula. Finally, observe that the size of the DHF formula created by this 
transformation has size polynomial in the DNF formula from which it was created. □ 

Using the above observation, it is easily shown that the problem of learning an agreement of 
Horn sentences (without any restrictions) is as hard as learning DNF. However, as demonstrated 
by our algorithm to learn the agreement of boxes, if the intersection of the Horn sentences in 
the agreement were non-empty then it may be possible to use the intersection information to 
successfully learn the disjunction. We now prove the stronger negative result that learning 
the agreement of Horn sentences even when the intersection region is samplable is as hard as 
learning the class of DNF formulas. 

Theorem 68 PAC with membership query learning the agreement of Horn sentences for which 
the intersection region is samplable is as hard as PAC with membership query learning the class 
of DNF formulas. 

Proof: We prove this through a sequence of prediction preserving reductions [86]. Let DHF- 
lpos be the class of DHF formulas with exactly one positive example p that satisfies every 
disjunct. Let agree-Horn-lpos be the agreement of Horn sentences that have exactly one 
example in their intersection. Finally, let agree- Horn be the agreement of Horn sentences with 
a samplable intersection region. 

Applying Claim 67 it immediately follows that the learnability of DHF implies the learn- 
ability of DNF. We now give a reduction showing that the learnability of DHF-lpos (even 
when the learner has a priori knowledge of the single positive example) implies the learnability 
of DHF . Let / be the target of the algorithm for DHF and f p be the target of the algorithm 
to be constructed for DHF-lpos . Choose as the zero vector as the positive example p known to 
the learner, and let {vi\... 9 v n } be the set of variables over which / is defined. We construct f p 
from / by adding an extra literal v t o the antecedent of every Horn clause in / as well as adding 
the Horn sentence (v — * F)(v[ -* F) - ■ >(o n — » F). Note that the only example satisfying every 
disjunct of f v is the zero vector p. 
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The DHF algorithm A simulates the queries for the DHF-lpos algorithm A+ as folic »vs: 
When A+ requests an example, A obtains a random example x (that assigns values only to 
V\ y ...<v n ), generates the example x l that is like x with the additional variable v set to 1, and 
gives x 1 to A+ labeled as x was. If A+ makes a membership query on some vector x\ if v = 0 
then A returns "1" and if v - 1 then A responds with the result of a membership query on the 
example x that is just x 1 with the setting for the variable v eliminated. Once A+ terminates 
with hypothesis h+, A is able to predict the label of any example, x, by setting v to 1 and 
evaluating / + on that example. Note that setting v to 1 causes the added Horn sentence in f p 
to evaluate to 0 and the antecedents of all the remaining Horn clauses to not be affected by x. 

We now show that the learnability of agree- Horn- lpos implies that DHF-lpos is learnable. 
It is at this point that we switch from learning a standard boolean concept to learning an 
agreement. Note that the learning problem for the class DHF-lpos assumes that the learner 
knows the single positive example that satisfies every disjunct of the the target. Any algorithm 
for agree- Horn- lpos can be used to learn DHF-lpos by simply providing the sole positive 
example of agree- Horn- lpos to DHF-lpos as p and changing all u ?" examples to positive 
examples. 

Finally, we show that the learnability of agree-Horn implies that agree-Horn-lpos is learn- 
able. Recall that a PAC with membership query learning algorithm must learn under any 
distribution Z). When the agree-Horn algorithm requests a random example, the simulation 
algorithm flips a fair coin. With probability 1/2, the simulation provides the agree-Horn al- 
gorithm with the single positive point in agree-Horn-lpos (and thus the positive region is 
samplable). Otherwise, a random example drawn from the oraxle is given to the agree- Horn 
algorithm. Clearly agree-Horn is a generalization of agree-Horn-lpos and thus at least a* 
hard. 

Thus it follows from this sequence of reductions that PAC with membership query learning 
the agreement of Horn sentences with a samplable positive region is as hard as PAC with 
membership query learning the cJass of DNF formulas. □ 

Finally, we further strengthen this result by using the hardness result of Angluin and 
Kharitonov [10] showing that, under the assumption that one-way functions exist, member 
ship queries do not help in learning DNF formulas (with an unbounded number of terms). 
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Corollary 69 PAC with membership query learning the agreement of Horn sentences for which 
the intersection region is samplable is as hard as PAC learning the class of DNF formulas 
assuming that one-way functions exist. 

5*6 Relating Agreements and Version Spaces 

We take a brief diversion and compare Mitchell's definition of version spaces [80] to agree- 
ments [78]. 

Given a concept class C of boolean functions, a finite sample E of examples, a set of positive 
examples PC £, and a set of negative 6 examples N C E, recall that the version space [80] 
is the set of concepts in C consistent with P and N. As there may be many such consistent 
concepts, a version space is often represented using two sets G and S defined as follows. Let 

C v = {/ 6 C:\fn eN /(n) = 0 A Vp € P f(p) = 1}. 

Next, taking g > f for functions g and / to denote that g is strictly more general than /, 7 
define 

G = {feCv.fyeCg>f}, 

and define 

S = {feCv :?geCf> g }. 

Intuitively, G is the set of concepts in C consistent with a set of labeled examples such that the 
concepts in G are maximally general and S is the set of concepts in Cv consistent with a set 
of labeled examples such that the concepts in S are maximally specific. 

Mitchell observes that associated with any G and S sets is a ternary function VS^gj (our 
notation) that expresses how a version space predicts the label of an example x: 

6 It need not be the case that PU N = E. 

7 That is, for every x € E, g(x) > f(x) and there is some xo € E such that g(xo) > /(£<))• In other words the 
set of examples labeled positive by g is a proper superset of the examples labeled positive by /. 
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VS [SiG ](*)= < 



1 if s(x) - 1 for each s G S, 
0 if g(x) — 0 for each g € G, 
? otherwise. 



We now relate agreements to VS^q and discuss why the former is more appropriate for 
our purposes. 

Let S and G delimit a version space. To see that VS[5 f G] defines an agreement of functions, 
observe that for all examples x, VS[ 5G ](x) = Agree 5uG (x). Specifically, if VS^g] labels an 
example x with "?," then for some s in S and some g in G y s labels x negative, and g labels 
x positive. Thus, since 3 and g disagree on the label of x, Agree 5uG labels x with "?" also. If 
VS[5 f q labels x positive, then, by definition, every s in S labels x positive. Since the elements 
of G are more general than the elements of S y every g in G labels x positive also. So Agree 5uG 
labels x positive also. (Dually when VS[5 fG ] labels x negative.) 

Furthermore, every agreement can be represented by a ternary function, VS[5 f G], for suitably 
defined S and G. Let F be a subset of C. Recall that E is a finite subset of examples, and that 
Agree F and / are functions with ranges {0, 1,?} and {0, 1}, respectively. Define 

C A = {/ G C : Vx 6 E f(x) = Agree F (x)}. 



Now define 



and 



G={feC A :]lgeC A g>f} f 
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S={feC A :}geC A f>g}. 

We show that, for all examples x, Agree F (x) = VS[s t G](x). 

Note that all the functions in the set F are trivially contained in the set C A . If Agree ^ 
labels x with there exists functions f { and / 2 in F such that fi labels x negative and f 2 
labels x positive. Since f x and / 2 are themselves in F, they are also in C A . Further, since f x 
and h are in C A , there exists s arid g in S and G respectively, such that s labels x negative 
and g labels x positive, where 3 is at least as specific as f { and g is at least as general as / 2 . 
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(So, regardless of where f { and f 2 "lie relative to" the elements of S and G, there is an element 
of 5 that labels x negative and an element of G that labels x positive.) Consequently, VS[s tG ] 
labels x with "?." 

If Agree F labels x positive then every / in Ca, labels x positive. Since S is a subset of Ca, 
every s in S labels x positive. Thus VSjs.g] labe k x positive also. (Dually when Agree F labels 
x negative.) 

We now discuss why the agreement representation is more appropriate for our purposes. 

Membership Queries The concept classes we investigate (for example, the agreement of 
a constant number of monotone DNF, fc-term DNF, or decision trees) are not known to be 
learnab K without membership queries. It can be shown that the candidate elimination algorithm 
(CEA) . -fined in [80]) requires exponential time to learn these concept classes. However, 
since the LuA does not have access to membership queries, the comparison seems ill-founded. 
Moreover, it is not clear how to augment the CEA'with membership queries. Thus, the CEA 
is not appropriate for the classes we consider. 

Order of Labeled Examples Even if we could augment the CEA with membership queries, 
since the CEA is sensitive to the order labeled examples are provided, it is again not suitable 
for our purposes. 

There has been previous evidence of the exponential growth associated with maintaining 
G and S sets [53, 57, 55]. Hirsh [62] has proposed avoiding this growth by constructing only 
the smaller of the sets G and S and maintaining the training examples as a representation of 
the unconstructed set. We present a situation in which the size of both the G and S set grow 
exponentially due to the order in which labeled examples are provided to the CEA. Thus, even 
representations which maintain only one of G or S do not inhibit the problem of exponential 
growth. 

Let C be the class of monotone monomials defined over the variable set {x;}^, such that 
C contains exactly the constant functions true and false together with monomials of the form 

/\i&s x i w h ere S is a subset of {1 2n) of cardinality exactly n. For example, if n = 1, 

C = {false, x\ ,x-2> true). 
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Now consider how the CEA behaves for general n. Observe that initially G = {true} and 
5 = {false}. Suppose the following thr^ labeled examples were supplied to the CEA: 

• the positive example (1 • • -11 ■ ■ ■ 1) with all variables satisfied, 

• the negative example (0 • • -00 • • - 0) with all variables falsified, and 

• the negative example (1 • • • 10 • • - 0) with the first n variables satisfied and the remaining 
variables falsified. 

After the first two examples, both G and S consist of the ( 2 ^*) monotone monomials from C. 
However, altering the order of the examples so that the third example is provided first causes 
the G and S sets to consist of one element each. In contrast, observe that the class C is trivially 
learnable with membership queries. 

Conciseness We have already demonstrated that the agreement representation is always as 
small as and sometimes exponentially smaller than the G and S set representation. Thus a 
positive result in the agreement representation is "stronger" in the sense that we are allowed 
possibly exponentially less time. 

Why is the G and S set representation not always concise? Our purpose is only to capture 
a ternary function. With this in mind, one reason for this relative lack of conciseness is that 
the agreement representation need not use only those functions that can be partitioned into 
two sets, one "general" and the other "specific;" indeed the agreement representation is free to 
represent the ternary function as a set of mutually incomparable functions. 

This relative lack of conciseness also occurs when the agreement representation does consist 
of functions that can be partitioned into the general and specific sets - sometimes there exist 
exponentially smaller subsets G* and S f of G and S respectively, that are equivalent in terms of 
prediction (that is, for all examples x, VS[ G < t s')( x ) = VS[ G ,s](*))- G and S may contain many 
more consistent functions than are necessary for accurate prediction [78]. 

5.7 Discussion 

We have presented a new formal learning model for learning from a consistently ignorant teacher. 
Along with giving several useful characterizations of this model, we have given some general 




conditions indicating when one can successfully learn under this model. In addition we present 
a polynomial time algorithm for learning the agreement of s boxes in E d for d = O(logs). We 
have also shown that learning the agreement of Horn Sentences is as hard as learning DNF 
from random examples. 

There are many interesting open problems raised by this work. First of all it would be 
interesting to explore other concept classes for which learning finite unions (or intersections) is 
hard to see if the agreement of concepts from the class is learnable. Also, although we have 
shown that the complexity of learning the agreement of an arbitrary number of Horn sentences 
is hard as learning the class of DNF formulas, this hardness result may not hold when the 
number of Horn sentences in the agreement is bounded with respect to the number of variables. 
For example, it is an open question as to whether or not there is an algorithm for learning the 
agreement of a constant number of Horn sentences. 
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Chapter 6 

Restricted First-Order Horn 
Sentences 

We now make a transition from propositional concepts to first-order concepts. We return to 
learning from entailment, and begin with a restricted class of first-order Horn sentences. 

A natural framework for research in inductive logic programming is the investigation of 
the learnability/predictability of various classes of definite clause theories, particularly in the 
PAC [98] and exact learning models [4]. Relatively little work has been done within this 
framework, though interest is rising sharply [15, 34, 35, 41, 70, 81, 82]. This chapter describes 
new results on the learnability of several restricted classes of simple (two-clause) definite clause 
theories that may contain recursive clauses; theories with recursive clauses appear to be the 
most difficult to learn. The positive results are proven for learning by equivalence queries, 
which implies PAC learnability [4]. In obtaining the results, we introduce techniques that may 
be useful in studying the learnability of other classes of definite clause theories with recursion. 

The results are presented with the following organization. Section 2 describes the learning 
model. Section 3 shows that the class '#2,1, whose concepts are built from unary predicates, 
constants, variables, and unary functions, is learnable. Section 4 shows that the class an 
extension of '#2,1 that allows predicates of arbitrary arity, is not learnable under a reasonable 
complexity- theoretic assumption. 1 Nevertheless, Section 4 also shows that each subclass H2M 

1 The classes and 'Hi t u (discussed next) have one other restriction, that variables are stationary. This 
restriction is defined later. 
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of ft2,*> in which predicates are restricted to arity k, is learnable in terms of a, slightly more 
general class, and is therefore PAC predictable. The prediction algorithm is a generalization 
of the learning algorithm in Section 3. The results of Section 4 leave open the questions of 
whether (1) is PAC predictable and (2) l<2 t k is PAC learnable. Section 5 shows that the 
techniques used in Section 4 can also be used to prove that the class TP 2 U CFTFB un i q [l5] y 
which allows higher-arity functions, cannot be learned. It conjectures that a related, and in 
some ways broader, class can be learned, using techniques from earlier sections. Section 6 
returns to unary predicates, bui, allows an arbitrary number of clauses; the resulting class is 
called H+ % \. Section 6 shows that this class is equivalent to the class of regular languages, 
though the question of representation size remains open. While the results of this section do 
not subsume any other results of the chapter, Section 6 does show that it may be possible to 
obtain learnability results for some classes of definite clause theories from results about formal 
languages and automata. Section 7 relates our results to other work on the learnability of 
definite clause theories in the PAC or exact learning models. The primary distinction of thifs 
work from the most closely related work [41] is that the classes studied in this paper are not 
determinate because functions can be nested to arbitrary depth. 2 

6.1 The Model 

Algorithms that learn concepts expressed in propositional logic traditionally have used as ex- 
amples truth assignments, or models. Such an example is positive if and only if it satisfies 
the concept. But concepts in first-order logic may (and almost always do) have infinite mod- 
els. Therefore, algorithms that learn definite clause 3 theories typically take logical formulas, 
usually ground atomic formulas, as examples instead. Such an example is positive if and 
only if it is a logical consequence of the concept. The algorithms in this chapter use ground 
atomic formulas (atoms) as examples in this manner. A concept is used to classify ground 
atoms according to the atoms' truth value in the concept's least Herbrand model, 4 which is 

2 More precisely, the classes that result from flatlening (rewriting to contain no functions) the concepts are 
not determinate. 

3 A definite clause is a disjunction of atoms, exactly one of which is unncgated. 

4 The least Herbrand model of a set of definite clauses (every such set has one) is sometimes referred to as the 
model or the unique minimal model of the set. The set of definite clauses entails a logical sentence if and only 
if the sentence is true in this model. It is often useful to think of this model as a set, namely, tlte set of ground 
atoms that it makes true. 
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to say t according to whether the atoms logically follow from the concept. For example, the 
concept (Vx(p(/( 5 (x))))] A [Vx(p(/(x)) - p(f(h{x))))], which is in W 2f i> classifies p(/($(c))) 
and as true or positive while it classifies p(/(c)) as false or negative. If A 

and 5 are two concepts that have the same least Herbrand model, we say they are equivalent, 
and we write A = B. 

In a learning problem, a concept C, called the target, is chosen from some class of concepts 
C and is hidden from the learner. Each concept classifies each possible example element x 
from a set X , called the instance space, as either positive or negative. The learner infers some 
concept C based on information about how the target C classifies the elements of the instance 
space X. For each of our learning problems, the concept class C is a class of definite clause 
theories, and we require that any learning algorithm, A, must for any C E C produce a concept 
C" E C such that C = C, that is, that C and C" have the same least Herbrand model. (For 
predictability we remove the requirement that C belong to C.) The instance sp^ce X is the 
Herbrand universe of C , and the learning algorithm A is able to obtain information about the 
way C classifies elements of X only by asking equivalence queries, in which A conjectures some 
C and is told whether C = C. If C ¥ C, A is provided a counterexample x that C" and C 
classify differently. 

We close this section by observing that the union of several classes can be learned by 
interleaving the learning algorithms for each class. 

Fact 70 Let p(n) be a polynomial in n } and let {C{ : I < i < p{n)} be concept classes with 
learning algorithms {A{ : 1 < i < p{n)} having time complexities {T At : 1 < i < p(n)} 
respectively. Then the concept class U?^6\ can be learned in time max i<i<p ^{p(n)T At } . 

6.2 The Class H 2% \ 

The concept class i& the class of concepts that can be expressed as a conjunction of at 
most two simple clauses, where a simple clause is a positive literal (an atom) composed of unary 
predicates and unary or 0-ary functions or an implication between two such positive literals. 
Allowing arbitrarily many literals in the antecedent of the implication, rather than only one, 
provides no additional expressivity. This is because we are considering the least Herbrand model 
to be the meaning of a concept in The least Herbrand model of a pair of implicative clauses 



is empty; on the other hand if one of the clauses is an atom A, the implicative clause will add 
atoms to the least Herbrand model only if every literal in the antecedent of the implicative 
clause unifies with ,1. 

As an example, the following is a concept in ft2,i that we have seen already. 

[V*(p(/(<7(*))))]A [Vx(p(/(x)) -+p{f{h(x))))] 

Since our conjuncts are always universally quantified, we henceforth leave the quantification 
implicit. Thus the above concept is written 

[p(/(5(*)))]A(p(/(x))-*p(/(M*)))] 

We can divide ft 2,1 into two classes: trivial concepts, which are equivalent ( = ) to conjunc- 
tions of at most two atoms, and non- trivial, or recursive, concepts. The trivial concepts of ft2,i 
can be learned easily. 5 We next describe an algorithm that learns the non-trivial, or recursive, 
concepts in ft2,i- It follows that ft2,i is learnable, since we can interleave this algorithm with 
the one that learn? the trivial concepts of ft2,i . 

It can be shown that the recursive concepts in ft2,i have the form 

b(*i)]A[p(< 2 (z))->p(i 3 (z))] (6.1) 

where t\ is a term, and £2(2) and tz{x) are terms ending in the same vari :ble, x. 6 The fact that 
the functions and predicates are unary leads to a very concise description of a recursive concept 
in ft2,t- Specifically, we can drop all parentheses in and around terms. Further, since we are 
discussing recursive concepts, all predicate symbols are the same and can likewise be dropped. 
Thus any concept having the form of concept (6.1) may be written [ae] A [/3x ~* 7a:], or 



(6.2) 

jx 



5 The basic idea is that the learning algorithm, by using equivalence queries, is able to obtain one example 
for each of the (at most two) atoms in the concept. Only a few atoms are more general than (that is, have as 
instances) each example, and the algorithm conjectures ail combinations of these atoms, one for each of the (at 
most two) examples. 

The proof of this is omitted for brevity, as arc some other proofs. 
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where a, /3, and 7 are strings of function symbols, x is a variable, and e is either a constant or 
a variable. Using this notation, determining whether, for example, ae unifies with (3x requires 
only determining whether either a is a prefix of /? or 0 is a prefix of a. For any strings a and 
/?, if a is a prefix of (3 then we write a</?. 

The atoms in the least Herbrand model that are not instances of the base atom in the 
concept 7 are generated by applying the recursive clause. A concept is equivalent to a conjunction 
of two atoms if its recursive clause can be applied at most once. For the recursive clause to 
apply at all, a must unify with /?, and for it to apply more than once, (5 must unify with 7. 
Hence a concept is non-trivial only if either a < (3 or /? < a and either (3 < 7 or 7 < /?. In light 
of Fact70, to show that the non-trivial concepts can be learned in polynomial time, we need 
show only that the class of non-trivial concepts can be partitioned into a polynomial number of 
concept classes, each of which can be learned in polynomial time. Therefore, we carve the class 
of non- trivial concepts into five different subclasses defined by the prefix relationships among 
a, /?, and 7. The approach for each subclass is similar — generalize the first positive example in 
such a way that the oracle is forced to provide a positive example containing whatever pieces 
of a, /?, or 7 are missing from the first; example. The five possible sets of prefix relationships 
that can yield recursive concepts, based on our earlier discussion, are (1) (3<a and /3<7, (2) 
a<(3 and /?<7, (3) a<7 and 7</3, (4) a</3 and 7<a, (5) /3<a and 7</3. We do not need to 
divide case (1) into two cases because the relationship between a and 7 is irrelevant here. 

6.2.1 /3<a and /?<7 

This class consists of concepts with ths form 




(6.3) 



Concepts of this form insert arbitrarily many copies of u between <f> and 0. Thus a concept of 
this class has for its least Herbrand model 



7 Since wo arc speaking now of recursive* concepts only, we refer to the two parts of the concept jus the banc 
atom and the recursive clause. 
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Lemma 71 This class can be learned in polynomial time. 



Proof: To learn this class an algorithm needs to obtain an example that contains u;, 
and 0. Every example contains <j> and 0, and every example except an instance of the base 
atom contains u;. The algorithm first conjectures the null concept and receives a positive 
counterexample ae> which is an instance of some atom generated by the concept (either the 
base atom or the result of applying the recursive clause some number of times). There are 
\a\ + 1 atoms of which ae is an instance — ae itself and each oax where a,- is a prefix of a. 
The algorithm guesses one of these as the generated atom. (Here and throughout, when we 
use the term guesses, we actually mean that the algorithm dovetails all the possible choices.) 
It then proposes the generated atom in an equivalence query and receives another positive 
counterexample, which is an instance of some other generated atom. The algorithm guesses 
this generated atom from this example in a similar manner. Then the algorithm guesses which 
of the two generated atoms is generated later; this atom must contain </>, ip (and e if e is a 
constant). Finally, the algorithm guesses whether e is a constant or a variable, 8 and, from the 
later atom guesses the values of <f> y u;, and 0. It is straightforward to verify that there are at 
most 2n 3 possible combinations of such guesses. □ 

6,2.2 a</3 and 0<7 

This class consists of concepts with the form 



Co, cepts of this form add copies of u; aft^r a prefix of <f>rl;. This class has as its least Herbraud 
model 

4>e + <£'0u>* 

provided &e unifies with (j>rt)z y otherwise the least Herbrand model degenerates to 0c. 
Lemma 72 This class is exactly learnable. 

8 If e is a constant, it is the last symbol in each example. 



(j>e 



(6.4) 



< 



<f>tpx — * (f>i)(jjx 
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Proof Sketch: Let pe be the the (positive) counterexample to the empty hypothesis. Assum- 
ing that p — 0#u/ fc £ for some k > 0, 0(|p| 4 ) guesses suffice to identify tf>, 0, and u/. 

If p does not contain a copy of u>, then p = (j>£ and 0(|/?|) guesses suffice to identify (p. Once 
<j> has been identified, conjecturing 

(pe (6.5) 



will elicit a (positive) counterexample p'e that does contain copies of both xp and u?; identifying 
each of these requires only O(\rho f \ 3 ) guesses. □ 

6.2.3 a<7 and 7</9 

This class consists of concepts with the form 

(be 

(6.6) 

(friputx — ► </>^a: 

The recursive rule adds nothing to the least Herbrand model, so concepts of this form have as 
the least Herbrand model 

cj>e 

, so concepts of this form are equivalent to 

(pe (6.7) 



Lemma 73 This class is exactly learnable. 

Proof Sketch: Let pe be the the (positive) counterexample to the empty hypothesis. Because 
p - 0Ci only 0(\p\) guesses are needed to identify (p. □ 
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6.2.4 a<8 and j<a 

This class consists of concepts with the form 



<j>ipux — ► <t>x 



(6.3) 



Again the recursive rule adds nothing to the least Herbrand model, so concepts of this form 
have as the least Herbrand model 

<f>'e 



, where $ = <f>rp. Thus concepts of this form are equivalent to 



4>'e 



(6.9) 



Lemma 74 This class is exactly learnable. 

Proof Sketch: Let pe be the the (positive) counterexample to the empty hypothesis. Because 
p = <f>X, only 0(|/>|) guesses are needed to identify <f>* . □ 

6.2.5 P<cl and y<0. 

This class consists of concepts having the form 

< (6.10) 
<t>ipx — * (f>X 

Concepts of this form generate smaller and smaller atoms by deleting copies of 0 at the front 
of w; if xf) is not a prefix of w, the concept can delete only one copy of Any concept of this 
form has the least Herbrand model described by 

00*C e for 1 < k < n + I, where w = i> n Q 
Lemma 75 This class can be learned in polynomial time. 
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Proof: To learn this class an algorithm needs to obtain an example that contains (p, w\ 
and It then must determine n. We give an algorithm that makes at most two equivalence 
queries to obtain <j> } rp y a;, and It then guesses larger and larger values for n until it guesses 
the correct value. This value of n is linearly related to the length of the base atom, so overall 
the algorithm takes polynomial time. 

1. Conjecture the false concept to obtain counterexample pe 

2. Dovetail the following algorithms 



to obtain counterexample />V. Note that 
/>' necessarily contains 0 as a substring. 

(c) Select 0 from p f 

(d) Guess n 

(e) Halt with output ✓ 



y (j>ipx — * <j>x 

When the algorithm selects substrings from a counterexample, it is in reality dovetailing all 
possible choices; nevertheless, we observe that there are only 0(|p| 5 ) (respectively 0(|/>'|)) 
choices to try. Similarly, when the algorithm guesses the value < -f n, it is actually making 
successively larger and larger guesses for n and testing whether it is correct with an equivalence 
query. It will obtain the correct value for n in time polynomial in the size of the target concept.. 
At that point it outputs the necessarily correct concept and halts. □ 

Some observations about the above example are in order. In these, and afterward, we use 
(jo to denote the base atom and, inductively, + l to denote the result of applying t he recursive 



• Assuming p = 00- 7 C? e for some j > 1 



(a) Select 0, 0, £, £ from p 

(b) Guess the value of n 

(c) Halt with output 




• Assuming p — 0C£e 



(a) Select 0,C>£ from p 

(b) Conjecture 
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clause to k is worth noting that the algorithm above uses only two of the counterexamples 
it receives, though it typically makes more than two equivalence queries. This is the case with 
the algorithms for the other subclasses as well. It is also worth noting that when the algorithm 
above guesses the value of n, it is guessing the number of times the recursive clause is applied 
to generate the earliest generated atom gi of which either example is an instance. 
From the preceding arguments, we have Theorem 76, below. 

Theorem 76 The concept class K24 is learnable. 

6.3 Increasing Predicate Arity 

It is often useful to have predicates of higher arity, but otherwise maintain the form of the 
concepts in H2 t \. For example 



plus(x, 0,x) 

plus(x, y, z) plus(x, s(y), s(z)) 



greater[s(x) J 0) 

greater(x , y) — *■ greater[s(x) J s(y)) 



In this section we remove the requirement that predicates be unary. Specifically, let be the 
result of allowing predicates of arbitrary arity but requiring functions to remain unary, with the 
additional restriction, which we define next, that variables be stationary. Notice that because 
functions are unary, each argument has at most one variable (it may have a constant instead of 
a variable), and that variable must be the last symbol of the argument. A concept meets the 
stationary variables restriction if for any variable x, if x appears in argument i of the consequent 
of the recursive clause then x also appears in argument i of the antecedent. This class does 
include the above arithmetic concepts built with the successor function and the constant 0, 
but does not include the concept [p(a,6,c)] A ?/, z) — ► p{z,x y y)] because variables "shift 
positions" in the recursive clause. 

We begin with the following observation. As for the class a concept in 'W2,* is trivial 

if it is equivalent to a conjunction of at most two atoms. It is straightforward to verify that, 
because variables are stationary, the non-trivial concepts in W 2) * have the form 

Me fc ))] A [pfl'.fc',) 4(c' fc )) - ^ilW),",^^))] (6.11) 



If the concept is a. conjunction of two atoms, the choice of which atom is and which is gy is arbitrary. 

no 
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where f,-(e,-), *((ej), and denote terms ending in e M ej, and e", respectively, and for some 

1 < 3 < k we have e f - = e" = x for some variable x. 

Unfortunately, we have the following result for the class 

Theorem 77 is not learnable, assuming RP ^ N P. 10 

Before proving the theorem, we prove the following useful lemma. 

Lemma 78 Any concept in H2,* generates at most two function-free atoms, that is, at most 
two atoms whose arguments are constants or variables. 

Proof: The result clearly holds for the trivial concepts in W2,*- Any other concept has 
form 6.11 above. Suppose a concept of this form generates at least three function-free atoms. 
One of these may be the base atom itself, while at least two of these atoms must be generated 
using the recursive clause. For the recursive clause to be used in generating such atoms, its 
consequent p(t f {(e f {), f£(e£))) must be function-free, that is, must be of the form p(e'{, e£)), 
where each e" is a constant or a variable. Let <4), where each d{ is a constant or a 

variable, be the first function-free atom generated using the recursive clause. We show that 
any atom generated after p(du ...,<4) is also p(d u hence, no third function-free atom is 

generated by the concept. 

For the recursive clause to generate an atom after p{d\, . p{d\, ...,<4) must unify with 
the antecedent of the recursive clause, p^e'j, ^(ejj.)). Therefore, we know that for each 
1 < i < k y at least one of the following is true: d( = t f { (e[) = c for some constant c, or f{(c{) 
is a variable x, or rf, is a variable. In the first case, e" = c (and in fact e : = c), so every atom 
generated by the concept has c = at argument i. In the second case, if e = x, then e t - = 
di, so every atom generated by the concept has d( at argument i; otherwise, e" = rf t , so every 
atom generated by applying the recursive clause has d : at argument i. In the third case, e'-' 
must be a variable y. We have just argued the case where t\{e\) and e" are the same variable, 
x. If ^(e-) ^ e", then d{ = e" = y, so every atom generated by applying the recursive clause 
has d{ at argument i. Thus every atom generated after p(d iy agrees with p(d\ , ...d*) at 

every argument. □ 

10 That is, Hi,* is not PAC learn able, and therefore is not learnable in polynomial time by equivalence queries. 
RP ia the class of problems that can be solved in random polynomial time. 
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Proof of Theorem 77: We show that the consistency problem [85] %r this concept class is 
NP-hard. Our reduction is from the consistency problem for two-term-DNF. The consistency 
problem is: given a set of labeled examples, determine whether there exists a concept in the 
class that agrees with all examples on their labels. In particular, the consistency problem for 
H2,* requires determining whether there exists a concept in 7^2,* that entails all of the positive 
examples but none of the negative ones. If the consistency problem for a given class is NP-hard, 
then that the class is not PAC learnable (assuming RP £ N P) [85], which in turn implies that 
it is not polynomial-time learnable by equivalence queries. 

We first claim, based on Lemma 78, that if all examples axe function-free then contains 
a concept consistent with the examples if and only if there exists a function-free conjunction of 
at most two atoms 11 that is consistent with the examples. To see the if case, notice that every 
function-free conjunction of at most two atoms is in To see the only z/case, notice that 

only function-free atoms can have other function-free atoms as instances; therefore, a function- 
free atom is classified positive by a given concept in if and only if it is an instance of one 
of the (at most two, by Lemma 78) function-free atoms generated by the concept. 

Therefore, it is enough to show that the problem of finding a function-free conjunction of 
at most two atoms consistent with a labeled sample of function-free atoms is NP-hard. (Doing 
so will in fact show that the consistency problem is NP-complete, since we can efficiently check 
whether a given conjunction of at most two atoms is consistent with a labeled sample.) To do 
so, we reduce from the consistency problem for 2-term-DNF [85]. 

Define a mapping / : {0, \} n — {0, l,z u z 2 , . . . ,z n } n as follows: y = f(x) where y i = 1 and 
y n+l = Z{ when x x — 1, and y x — z x and y n + l = 0 when x x — 0. 

Now consider any labeled sample S and construct £" = {p(y)\x 6 S,y = f(x)} where S' 
preserves the labels of S. We show that there is a two atom conjunction consistent with S' if 
and- only if there is a 2-term-DNF consistent with 5'. 

//: Define a mapping g from (propositional) terms, t, to atoms as follows. Start with the atom 

p(*i>*2> z n ). If the literal X{ appears in *, change z { to 1. If the literal x* appears in t, 

change z n + t to 0. Notice that for any (propositional) term t and any example x, every instance 

11 That is, a single function -free atom or a conjunction of two function-free atoms. 
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of p{f{x)) is an instance of g(t) if and only if a: satisfies t. Thus, g{t\) A g{t 2 ) is consistent with 
S f if J 1 V t 2 is consistent with S. 

Only If: Now consider a conjunction of two atoms consistent with 5'. We must show how to 
construct a 2-term-DNF consistent with S. 

Consider any two atom conjunction, at A 02, consistent with 5'. For each atom, a, construct 
a (propositional) term as follows. If argument i(i < n) of a is 1, place x t - in the term. If 
argument n + i of a is 0, place X{ in the term. If arguments i and j (i < j < n) are the same, 
place both x,- and Xj in the term. If arguments n + i and n + j are the same, place both af; and 
Xj in the term. We claim that the disjunction of the two terms so formed from a\ and a 2 is a 
2-term-DNF consistent with 5. 

Observe that we can form two (not necessarily disjoint) subsets, S[ and S 2 of S f such that 
each atom in S[ is entailed by a\ and each atom in S 2 is entailed by a 2 . Notice that neither 
S[ nor S 2 contains any of the negative examples of S' y but between them they contain all the 
positive examples of S f . Without loss of generality consider S[ and a\. If the ith argument of 
a\ is 1, then the ith argument of every atom in S[ is 1. But the atoms in S[ have 1 as their 
ith argument only if their (propositional) counterparts had X{ set to 1. Therefore, placing X{ in 
the (propositional) term arising from a\ does not mis-label any of the (propositional) examples 
that gave rise to atoms in S[. Similarly if the n + ith argument of at is 0. 

Finally, notice that no atom in S r contains a co-reference (ie, no variable appears twice in 
an atom). Thus if at contains a co-reference, the co-reference must correspond to arguments 
that were equal constants in the atoms in S[. However, any constants appearing among the 
first n arguments of an atom in S[ must be 1, while any constants appearing among the last n 
arguments of an atom in S[ must be 0. Therefore, we can make at more specific by changing 
a co-reference among the first n arguments to a 1, and by changing a co-reference among the 
last n arguments to a 0; the resulting atom entails no more than at but still entails every atom 
in S{. 

A similar construction can be done for a 2 and S^. Disjoining the (propositional) terms so 
constructed produces a 2-term-DNF consistent with 5. □ 

Our conjecture is that the class H 2t * is not even predictable (that is, learnable in /erm^ 
o/any other class), though this is an open question. Nevertheless, we now show that if we 




fix the predicate arity to any integer k, then the resulting concept class 'H 2 ,k is learnable in 
terras of a slightly more general class, called H2 % k -> and is therefore predictable (the question of 
learnability of 7^2,/; i n terms 0fW2.it itself remains open). Concepts in %2,k ma y be any union 
of a concept in 7^2,/; and two additional atoms built from variables, constants, unary functions, 
and predicates with arity at most k; an example is classified as positive by such a concept if 
and only if it is classified as positive by the concept in *Hi,k or is an instance of one of the 
additional atoms. For the class T^.k'* we consider conjunctions of at most three atoms to be 
trivial concepts. The class of such concepts can be learned by a simple algorithm that is similar 
to the algorithm that learns the trivial concepts of 7^2,1- We now focus, instead, on the learning 
algorithm for the recursive concepts of H2,a/- This learning algorithm is based on the learning 
algorithm for 7^2, i> an< ^ central to it are the following definition and lemma. In the lemma, 
and afterward, we use Gq to denote the base atom and, inductively, to denote the result 
of applying the recursive clause to G t . The following lemma allows us to learn 7^2,* using the 
algorithm for learning 7^2,1- 

Definition 79 Let 

p{aie u ..^a k e k ) 

be a concept in H 2 ,k- Then we say the subconcept at argument i, for 1 < i < k, of this concept 
is 

aid 

Lemma 80 Let 

p{aie u ,..,a k e k ) 

p{f3ie' u ...,P k e' k ) ->p(7ieit-t7**2) 

be a concept in H2 t k- For any 1 < j < k, if e f - is a variable x f then for any n > 2; if 
G n — p(ti,...,t k ) unifies 12 ivith p(P\e[, ...,/3 k e f k ), the binding generated for x by this unification 
is the same as the binding generated for x by unifying tj with ffjX.™ 



12 Note that G3 exists if and only if G2 unifies with />(/Jie'i, /^cj.). If G3 doe3 not exist, the result holds 
trivially. 

l3 The proof of this lemma is rather long. We present it here because the lemma is both non-obvious and 
central to one main result of the chapter, Theorem 81. 
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Proof: We prove the contrapositive. Assume that for some 1 < j < fc, e f - is a variable x, and 
for some n > 2, G n — p(t\j ...,th) unifies with p{P\e[ y ...j/?^), but the binding generated for x 
by this unification is not the same as the binding generated for x by unifying tj with pj. Then 
for some 1 < i < fc, e, is also x, and unifying t{ with /? : x yields a different binding for x than 
does unifying tj with Pjx. We obtain a contradiction by showing that these unifications cannot 
yield different bindings. 

Because the concept is stationary, and e[ and ej- are each x, each of e" and e'j must be 
either x or a constant. Furthermore, for the recursive clause to be applicable to generate (?3, 
if e" and e" are both constants then they must be some same constant, c. This leaves us three 
cases to consider. 

First, both e" and e" are some constant c. Then the recursive rules for argument i and 
argument j, respectively, are 0{X 7,c and /JjX — ► 7jC Hence Gi has 7,c and 7jC at arguments 
i and j, respectively. Then since G2 is generated, Pit = 7,-c and = 7jC for some term J. Then 
argument i and argument j aiways, individually bind x to t after (?2, giving a contradiction. 

Second, e" is a constant, c 7 and e" is a variable, x. (By symmetry, this case also covers 
the case where e" is x and e" is c.) Then the recursive rules for argument i and argument 
j, respectively, are /?,x — ► 7,-c and /3jX — ► 7jX. Hence Gi has 7,-c at argument i. Since G2 
is generated, we have Pit = 7,c for some ground term Then on unification of G\ with 
p(/?iei, ...,/?fce^), x is bound to L Furthermore, it must be the case that either Pj < jj or 
Ij < Pj for unification to succeed. Consider both cases. 

Case 1: Pj < 7;, so Pj(j> = for some <j> such that |<£| > 0. Then G2 has ijt = pj(j>t at 
argument j, and 7,-c = at argument i. Since G3 is generated, (Pit,Pj<j>t) must unify with 
(PiX y Pjx), which can occur only if \<f>\ = 0. If |<£| = 0, then after G2 the recursive rules for 
argument i and argument j will always, independently bind x to t, giving a contradiction. 

Case 2: 7> < so 7;<£ = /?,■ for some <j> such that |<£| > 0. Then G2 has 7^ at argument 
j and 7,-c = pit at argument i. Since G3 is generated, (/M»7>*) must unify with (piX.Pjx) = 
(PiX^fj(j>x) y which can occur only if \(j>\ = 0. If \(j>\ = 0 then, again, after G 2 the recursive rules 
for argument i and argument j will always, independently bind x to t, giving a contradiction. 

Finally, both e" and e" are the variable x. Then the recursive rules for argument i and 
argument j, respectively, are (J,-* — 7.x and — 7^. Then G 2 has 7,/ and 7j/, for some 
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terra t, at argument i and argument j, respectively. Since G3 is generated, we have either 
ft < 7t or 7i < ft. Consider both cases. 

Case 1: ft < 7;, so ft<£ = 7,- for some <f> such that j<£| > 0. For G 2 to be generated, (7i*,7j0 
must unify with (ftx,ftx). This is possible only if ft<£ = 7,, in which case, during unification 
with <7 m , m > 1, the recursive rules for argument i and argument j will always, independently 
bind x to $ m £, giving a contradiction. 

Case 2: 7,- < ft, so ft = 7,^ for some such that \</>\ > 0. For G2 to be generated, 
(7i*,7j0 must unify with (ftx,/3jx) = {*H<t>x,f3jx). For this to occur, either (A) * = i/;y, for 
variable y, and tj) < <£, or (B) £ = V^, for some constant or variable e, and <f> < 0. If (A) 
is true, then since t — ifry where i\) < <j> y we have ^u; = <f> for some w such that |w| > 0. Since 
(7i'i7>0 = {li^V^lj^y) must unify with (7,^x,/?jx) = (7i^wx,ftx), it must be the case that 
Pj = 7j^u;. Then x is left unbound during unification, so G3 has 7,x and 7 ; x at arguments i 
and j, respectively. Then for each G m , m > 3, x remains unbound, and G m has 7;x and 7jX 
at arguments i and j respectively. Because x remains unbound throughout, bindings given by 
the recursive rules for arguments i and j cannot disagree — a contradiction. If (B) is true, then 
since </> < ip, we have i\) — for some u; such that |w| > 0. For (7i*,7j0 = (7t0e,7j0e) = 
{^i4>ue^j(j)ue) to unify with (7i0x,/?jx), it must be the case that ft = 7j<£. Then x is bound to 
we. The recursive rules for arguments i and j therefore have the forms, respectively, 7,-^x 7,x 
and ~fj(j>x — ► 7jX. Hence <7 m , ro > 3, is generated if and only if 4w = ^77 for some 77, where 
> (ro - 2). In unifying <7 m , ro > 2, with p(ft^i, ...jfttejt), the recursive rules for arguments i 
and j will always, independently bind x to (j> n ~ m Tj — a contradiction. □ 

Theorem 81 For any constant k, the class 'H 2y k * 5 predictable from equivalence queries alone. 

Proof: (Sketch) By Lemma 80, any concept C in H2,k can be rewritten as the union of 
three concepts (only polynomially larger than C), two of which are the atoms Go and G\* The 
third concept is some C in whose base atom G'o is G2 and whose recursive clause (if any) 
generates G'i = G3, . . . , G m — G m +2, • - • meeting the following condition: for all ro > 0 and any 
variable x in the antecedent of the recursive clause, no two argument positions impose different 
bindings on x when generating G m +\ from G m , In other words, the behavior of C can be 
understood as a simple composition of the behavior of the subconcepts at arguments 1 through 
k. This observation motivates an algorithm that learns H2A in terms of Hz.k* A.t the highest 
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level of description, the algorithm poses equivalence queries in such a way that it obtains, as 
examples, instances of Go, G\, and Gi and Gj for distinct i and j. The algorithm determines 
Go and G\ from their examples, and it determines C from G,- and Gj. To determine C the 
algorithm uses the learning algorithm for 7^2,1 to learn the subconcepts of C. u We now fill in 
the details of the algorithm. 

The algorithm begins by conjecturing the empty theory, and it necessarily receives a positive 
counterexample in response. This counterexample is an instance of some more general atom, 
Ai, that is either Go, G\, or some G : -. The algorithm guesses A\ and guesses whether A\ is Go, 
Gi, or some G,\ It then conjectures A\ in an equivalence query and, if A\ is not the target, 
necessarily receives another positive example. (As earlier, by guess we mean that the algorithm 
dovetails the possible choices.) This example is also an instance of Go, G\, or some G,, but 
it is not an instance of A\. Again, the algorithm guesses that atom — call it A2 — and guesses 
whether it is of Go, Gi, or some G z \ Following the second example, and following each new 
example thereafter, the algorithm has at least two of of Go, Gi, G : -, and Gj (some i / j). 
It conjectures the union of those that it has, with the following exception: if it has both G : -, 
and Gj (any i ^ j) y it uses a guess of C in place of Gi and Gj. Again, in response to such a 
conjecture, either the algorithm is correct or it receives a positive counterexample. It remains 
only to show how (1) the atoms Go, G 1? G,, and Gj are "efficiently guessed", and (2) C is 
"efficiently guessed" from G : - aid Gj. 

Given any atom, A, entailed by Go, we know that Go is a generalization A. Because no 
function has arity greater than 1, there are at most 0((2| A\) k ) generalizations of A to consider, 
all of which can be tried in parallel. Note that, because k is fixed, the number of possible 
generalizations is polynomial in the size of the example. Gi, G t , Gj are efficiently guessed in 
the same way. 

To learn G, the algorithm first determines the high-level structure of G; specifically, it guesses 
how many variables are in the base atom and in the antecedent and consequent of the recursive 

14 The only subtlety is that the learning algorithm for H 2 ,i might return a concept that is a conjunction of at 
most two atoms; such a concept cannot serve as a subconcept for a member of Hi t k. But the only case in which 
this can occur is the case for which every atom after go is the same. And we provide the learning algorithm 
with only examples of g 2t ... (examples extracted from G 7i G 3 , ...). Therefore, if the concept returned by 
the learning algorithm is a conjunction of at most two atoms, it is in fact a single atom, cxc, in which case we 
may use the concept in H^.i whose base atom, recursive clause antecedent, and recursive clause consequent are 
all a a. 
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clause C and which subconcept uses which (if any) variable. There are at most 0{k 3h ) 
possibilities, where k is fixed. The algorithm is then left with the task of precisely determining 
the subconcepts of C. Because it hats two examples of distinct (most general) atoms generated 
by C — G{ and Gj — it has two examples of each subconcept. The algorithm uses the learning 
algorithm for %2,\ to learn these subconcepts from these examples, with the following slight 
modification. While the learning algorithm for H 2 ,i would ordinarily conjecture a concept 
in the present learning algorithm must conjecture a concept in H2 t k- Therefore, the 

algorithm conjectures every concept in W2,// that results from any combination of conjectures, 
by the learning algorithm for 7^2,1? for the subconcepts; that is, it tries all combinations of 
subconcepts. Because k is fixed, the number of such combinations is polynomial in the sizes of 
the counterexamples seen thus far. Finally, recall that in some cases (as in concept 6.10, page 
4) the learning algorithm for 7^2,1 must guess the value for n, where n is the smallest number of 
times the recursive clause is applied to generate one of the examples. 15 Therefore, the present 
learning algorithm may have to guess n. This is handled by initially guessing n = 1 and guessing 
successively higher values of n until the correct n is reached. This approach succeeds provided 
that the target truly contains a type 6.10 subconcept. In the case that the target does not 
contain a type 6.10 concept the potentially non-terminating, errant search for a non-existent 
7?. halts because we are interleaving the steps form all pending guesses of all forms — including 
the correct non-type 6.10 form — of the subconcept. □ 



6.4 Increasing Function Arity 

This section provides some evidence that the techniques introduced in the previous sections are 
useful in studying the learnability of other classes of definite clause theories with recursion. In 
this particular, this section considers the concept class TP 2 U C FTFB uniq studied by Arimura, 
Ishizaka, and Shinohara [15]. They show that TP 2 U CFTFB uniq is learnable in the limit 
with polynomial update time, from positive examples only. Until now, the learnability of this 
class in the PAC learning and equivalence query models has been unknown. Using techniques 
that allowed us to show that the class W2,* is not learnable, this section shows that the class 
TP 2 U C FT FB un i q is noHearnable (in both models), assuming P ^ RP. 

15 The n sought is necessarily the same for all subconcepts; thus only on n needs to be found. 
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Arimura, Ishizaka, and Shinohara define the class CFTFB to contain each Prolog program 
P defining the predicate p with at most two clauses Cq and C\ 



Co = p{si,->s k ) 

C x = p(xi,...,xjt) -* p(*i, 
meeting the following additional conditions. 

1. Every argument s,- (1 < i < k) of the head of Co is either a function symbol of arity 0 or 
a variable symbol. 

2. All arguments x\ 9 of the body of C\ are mutually distinct variables. 

3. For every 1 < i < Ar, every argument x,- of the body of Ci occurs exactly once in the term 
U of the head. Moreover, X{ does not occur in any arguments tj (i ^ j) of the head. 

In addition, the predicate arity k is fixed (as in the class ft 2 jt, for example). Notice that another 
way to state condition (1) is that C 0 is function-free. The class CFTFBuniq is the subclass of 
C FTFB whose concepts have a unique 2-mmg; a 2-mmg is a pair of maximally-specific atoms 
whose instances include all the ground atoms in the concept's least Herbrand model (the ground 
atoms that logically follow from the concept). The class TP 2 is the class of all conjunctions of 
atoms (again the bound on predicate arity is assumed, though, again, no bound is placed on 
function arity). The class TP 2 U CFTFB uniq obviously contains exactly the concepts in TP 
or CFTFB un i q . 

Theorem 82 The class TP 2 U CFTFB uniq is not learnable, assuming P ^ RP. 

Proof: [Sketch] The proof is a nple variant of the proof of Theorem 77. Instead of function- 
free examples built from an n-ary predicate p y where n is arbitrary, we use examples built from 
a unary predicate p (thus the proof works regardless of how small k is restricted to be) and 
an n-ary function /. The proof is modified only slightly to replace claims about function -free 

concepts by analogous claims about concepts of the form /„)), where t\ t n are 

either constants or variables (are function free). □ 
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We now define the class CFTFB 1 to be the union of C FT FB and conjunctions of at most 
two atoms, one of which is function-free. In a way, this class is a more natural extension 
of CFTFB, in that it can also be described as the union of a function-free atom Co and a 
clause C\ which may either be an atom or a clause of the form p{x\, x m ) — *• p{t\,*.*,t m ) 
meeting the restrictions given above. Notice also that this class is in some ways broader than 
TP 2 U CFTFB un i qy since it includes those concepts in CFTFB that do not have a unique 
2-mmg. We conjecture that CFTFB 1 can be learned using methods similar to those used in 
this chapter for 7^2,1 and H2,k- This is an important area for further work since (1) it would be 
the first positive learnability result, to our knowledge, for a non-determinate class of recursive 
concepts with functions of arity greater than one, and (2) it would provide additional evidence 
for the utility of the techniques presented in this chapter. 

6*5 Increasing the Number of Clauses: The Class 

In this section we return to the class 7^2,1 extend it along a different dimension, allowing 
an arbitrary number of clauses. In other words, we consider concepts that consist of arbitrarily 
many definite clauses with antecedents of at most one literal where again all literals in the 
clauses are built from unary predicates and from constants, variables, and unary functions. We 
call this class The following example is a concept in H*,i that defines the even integers 

using the constant 0 and unary functions s for successor and p for predecessor (of course, under 
this definition each integer is represented by infinitely many different terms). 

even(O) 

even(x) — * even(s(s(x))) 
even(x) -* even(s(p(x))) 
even(x) — ► even(p(s(x))) 
even(x) — > even(p(p(x})) 

As in the previous section, because all literals have the same predicate and all functions 
are unary, we may disregard the predicate and view the literals as strings. And again, we 
may assume that for any recursive clause, both literals end with some same variable x. In this 
section we introduce prefix grammars and prefix languages, and we indicate how the concepts 
in may be viewed as prefix grammars. Moreover, every prefix grammar is equivalent to 
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a concept in where, by equivalent, we mean that the set of strings generated by the 

grammar is exactly the least Herbrand model of the concept, when the members of this model 
are viewed as strings. The major result of this section is that the class of languages generated 
by prefix grammars, and therefore the class of least Herbrand models of concepts in W* fi , is the 
same as the class of regular languages. While this result alone does not lead to a polynomial- 
time learning algorithm, it does lead to a weaker learning algorithm, and it also shows that 
the answers to open questions about prefix grammars will provide learnability results (either 
positive or negative) for 

Definition 83 (Prefix Grammar) A prefix grammar is a triple 

G = (S,5,P) 

where E is a finite set of symbols, S is a finite set of strings over E, and P is a finite set of 
productions. Each production in P has the form a — ► (3 where a and (3 are strings over E. The 
sentential forms for G are the strings over E alone} 6 A production a — ► 0 may be applied to 
a string 7 if and only if *y = au> for some uj; the result of applying the production to 7 is the 
string 0u>. L{G) contains exactly the strings u? such that either: u> € S or u is the result of 
applying a production in P to a string in L(G). 

Prefix grammars are so named because their productions may be applied only to a prefix of 
a string, to rewrite that prefix. Note that, in contrast with ordinary grammars, every sentential 
form generated by a prefix grammar G is a member of L{G). The prefix languages are exactly 
the languages generated by prefix grammars. 

The following example shows the set of definite clauses for the predicate even from the 
previous example rewritten as a prefix grammar. Predicate symbols are omitted, literals are 
treated as strings, and variables in the recursive clauses are treated as the empty string. The 
grammar generates strings in precisely the same way that the set of definite clauses generates 
atoms. The symbol e denotes the empty string. 

S = {0} 

lfi That is to say, there arc no nou-terminal symbols in this grammar. 
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P = {e — 55, e — sp, t -* ps, e -+ pp} 



Theorem 84 The classes of prefix languages and regular languages are the same. Specifically, 
for any prefix grammar there exists a right-linear grammar only polynomially larger that gen- 
erates the same language. And for any DFA, there exists a prefix grammar that generates the 
language accepted by the DFA, and the grammar is at most exponentially larger than the DFA. 

Because the proof is relatively long, we spread it over the following two sections of the paper. 

6*6 The Regular Languages Contain the Prefix Languages 

We present an algorithm that transforms a prefix grammar Gp into a right-linear grammar 
Gr only polynomially larger than Gp\ We assume that Gp has no productions of the form 
e — ► £, because we can eliminate all such productions with at most polynomial growth of Gp 
as follows. We assume 6 is nonempty, since any production e — ► e can be deleted. For each 
production e — ► 6 and every symbol c in Gp, add the production c — ► 6c to Gp, and if e is a 
production of Gp, add 6 to the base set S of Gp. Remove every production t — * S from Gp. 

6.6.1 Prefix Grammar to Right-Linear Grammar Transformation Algorithm 
Input: A prefix grammar Gp having base set a n } and productions {(3\ — ► 71, 

0m — 7m}. 

Output: A right-linear grammar Gr for Z/(Gp). 

We define a grammar Gr with start symbol 5, nonterminals l^ p ...,l^ m1 and all symbols 
of Gp as terminals. Let Gr contain the following productions. 



In addition, add to Gr productions of the following form until no more productions can be 



7m V& 



'm 



added. 



Vq x — »■ (jjVff ) where 



- <i>X Vff, 



M 
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are productions already in Gr, <t>\ y ...,<?>* are strings of terminals of £>i #2 

= Pi :-j is a nonempty string of terminals of Gr, Vq { and Vg^ Vq^ ^ are 

nonterminals of Gr, and Vq is a nonterminal of Gr or is e. Notice that 
are nonempty since every right-linear production has a nonempty 
terminal string. 17 

(1) Termination: To see that the algorithm halts with a grammar Gr only polynomially larger 
than Gp, observe that the terminal string in each new production is a suffix of some a, or 7,- 

in Gp y and no new non-terminal symbols are introduced. Let a be the longest of c*i a n , 

let P be the longest of P u ...,/3 m , and let 7 be the longest of 7i,...,7m- It follows that at most 
n|a| + m|7| such suffixes exist. Thus there are only m(m + l)(n\a\ + m|7|) productions 18 added 
to Gr, although the algorithm may take exponential time to construct this grammar. (We 
are interested in the algorithm only to establish an equivalence between the languages and to 
describe the sizes of their representations; thus it need not be efficient.) 

(2) L(Gp) = L(Gr): The intuition behind the algorithm is helpful in understanding why 
L(Gr) = L{Gp). In presenting the intuition, and in the proof, we give the nonterminal S of 
Gr the alternative name "V c ". Note that this does not create a naming conflict since t never 
appears as the left-hand side of a production of Gp. The strings a,-, 1 < i < n, in Gp each 
assert, "a, is a string in L(Gp)." The productions pj 7,, 1 < j < m, in Gp assert, "if 

is a string in L(G P ) for some string u>, then 7 ; u? is also in L{Gp)" or less formally, "anything 
that can follow pj can also follow The right-linear grammar Gr simply makes these same 
assertions in a different way. We want a nonterminal V Q to represent the set of all strings that 
can follow p in L(Gp)\ that is, we want the set of strings that Gr can generate from V :i to be 
the set of all strings u? such that pu e L(Gp). As a special case, notice that we want 1',, or 

17 We use the phrase terminal string to moan the entire string of terminals on the right-hand side of a 
production. 

18 Ft ran bo shown that the maximum size of any such production is bounded by a polynomial in the 
size of Gp. 
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5, to represent the set of all strings that can follow e in L{Gp), that is, the set of all strings 
in L(Gp). This shows L{Gr) = L{Gp). To achieve this result, we want a production Vp — ► 7 
in Gr to assert, "7 can follow 0 in L{Gp)," and we want a production Vq^ — *■ 7V/j 2 to assert, 
"7, followed by any string that can follow /?2 in L(Gp) y can follow /?i in L(Gp)" The following 
lemma says this is indeed what the productions of Gr assert. The lemma is then used to prove 
that L{Gr) = L{Gp) by showing that the productions force the nonterminals to represent the 
appropriate sets of strings. 

Lemma 85 At any time during the construction of Gr, (l)ifVp-+~fis added to Gr then we 
have j3y G L(Gp), and (2) if Vp 7V/J/ is added to Gr then for any /?'u> G L{Gp) we have 
07cj G L{G P ). 

Proof: All the V* c -productions are added first. The result clearly holds after each of these is 
added: a production V t — ► 7 is added only if 7 is a production of (7p, so 7 = 67 G L(Gp), 
and a production V c — ► 7V/3/ is added only if /?' — ► 7 is a production of Gp, so if /?'u> G L(Gp) 
then applying /?' -+ 7 yields 70; = 670? G L(Gp). Assume all productions added up to some 
arbitrary other point in the construction of Gr have the specified properties. If the algorithm 
now adds a production Vp — * 7, Gr must already contain productions 



where <f>\ 02--. <t>k = So f}i kmmX 4>kl G L{Gp) and Pi k _ 2 (f>k-\<t>kl € Z/(Gp) and... and €^02 -0*7 = 
07 e L(G P ). 

Similarly, if the algorithm now adds a production Vp -* 7^/, then G# must already contain 
productions 
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where <j>i<j>2...<pk = /?• Then for any 0'u; G L{Gp), it is the case that di^faiv G £(Gp) and 
0i k _ 2 (j>k-i<f>k~fv G L(Gp) and ... and €<j>\<j>2...(pkiw = /?7u> G L{Gp). □ 

(2.1) L(G P ) D Z,(G H ): 

Proof: L(Gp) D L(Gr) may be restated as follows: if K c 4> 7 is a derivation of Gn then 
£7 = 76 L(Gp). We prove, more generally, that for any nonterminal Vp of Gr, if Ktf ^ 7 
is a derivation of Gr then /J7 ^ £(Gp), and if =4> 7^/ is a derivation of then for any 
0 f u? G L(Gp) we have /J7W € L(Gp). The proof is by induction on the length of the derivation. 
If the length is 1, then Vp -+ 7 (Vp -* 7V/3/) is a production of Gn, and the result therefore 
holds by Lemma 85. If the length is greater than 1, let Vp — ► 7iV^» be the first production 
applied in the derivation. By Lemma 85, for any ff'w G L(Gp) we have /^u; G L(Gp). Then 
the remainder of the derivation must be 71 V^// => 7^ (or 7iV^» 4> 7171 V/?/) where 7^2 = 7. 
This derivation implies the existence of a derivation Vpn => 72 (or V^// 72^') of the same 
length. By the inductive hypothesis, Vp» 72 implies f3"*y 2 G L(G P ) (and V r ^// => ^ 2 Vp» implies 
that for any ffd G Z/(Gp) we have /J'^w' G L(G P )). Therefore, because if G L(G P ) then 
#7iw G L(G P ), letting w be 72 (or 7 2 w') we have £7172 = /?7 G Z/(Gp) (or /?7i7 2 u/ = /fyw' G 
IAGp)). □ 

(2.2) L(G P ) C Z,(G H ): 

Proof: We begin by proving that for any nonterminal Vp other than V t < if Gr generates 7V0 
and Gp generates /?w then G/{ generates 7u\ The proof is by induction on the length of the 
derivation of jVa in Gr. If the length of the derivation is 1, Gr must contain a production 
\\ — yVg. Then 0 — 7 must be a production of Gp. Therefore, since Gr generates jiu and 
has a production ii — 7, it generates 70; as well. If the length of the derivation of jVp in Gr 
is greater than I then, because all productions in G n are right-linear, the last sentential form 
generated before jVa in a given derivation is some 71V/J', where 7 = 7^/2, 72 is nonempty, and 
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Yg, 72V/J is a production of Gp. By Lemma 85 we know from Vp* —* 72^ and (3w 6 L{Gp) 
that (J f j2U 6 L(Gp). By the inductive hypothesis, since G;* generates 71 V#/ (with a shorter 
derivation than for 7VJ5) and 6 L(Gp), we know G# generates 7172W = 70;. 

To prove the complete result, let w be any string that Gp generates. If the derivation of u 
has length 1, that is, if u is itself a production of Gp, then Gh has a production V c — * u;, so 
clearly G# generates w. Otherwise, the last production applied in the derivation of w is some 
/? — ► 7 that maps #7' to 77' where w = 77'. Because Gp has the production /? -> 7, Gp has the 
production K c -> 7V/J. Since Gp generates jVp and Gp generates /?7', by our previous result 
we know that Gp generates 77' = u/. □ 

6.7 The Prefix Languages Contain the Regular Languages 

input: dfa M = {Q y q 0 ,Yl,F y 6) 
output: prefix grammar G 

1. For each / e F 

(a) For each string 7 accepted by a cycle-free transition sequence from qo to /. 

(b) Add 7 to the base set S of G 

2. For each qeQ 

(a) For each pair of strings (a,/?) each of which induce a cycle-free transition sequence 
from qo to q 

i. Add the production a — > /? to G 

ii. Add the production /? — * a to G 

(b) For each pair of strings (a,/?) where a induces a cycle-free transition sequence from 
qo to and 0 induces a cycle from q to q which does not properly contain any cycles 

L Add the production a -* a(i to G 

Claim 86 77ie algorithm produces a prefix grammar that defines exactly the same language as 
the given DFA. 
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Proof: First observe that at each step the algorithm looks for cycle-free transition sequences. 
Such sequences are bounded in length by \Q\ y and hence there are at most \Q\\ such sequences. 
Thus the algorithm terminates. 

Next observe that for DFAs, a given string and an initial state uniquely determine the state 
after applying the string to the initial state. Now, every production in G replaces a prefix of a 
sentential form with another prefix, and the replacement prefixes are chosen from among those 
strings which leave the DFA in precisely the same state as the replaced prefix. Thus every 
production of the grammar is guaranteed to preserve acceptance. Finally, the only sentential 
forms explicitly given for L{G) are strings accepted by the DFA. Thus L(G) C L(M). 

What now remains is to show that L(M) C L(G). This portion of the proof is accomplished 
by strong induction on the number of state revisitations 19 by a string accepted by the DFA. 
Suppose w is accepted by M and no state revisitations occur. In this case, this string was 
explicitly placed in the grammar G in step 1 of the algorithm; thus w £ L(G). Now suppose 
inductively that all strings accepted by M causing at most k state revisitations can be gener- 
ated by G y and coxisider any string w accepted by M which causes k + 1 state revisitations. 
Necessarily, w must cause M to revisit some state within its first \Q\ symbols. Furthermore, 
without loss of generality, we may assume that this revisitation is a simple-cycle. Consider 
the first such revisit. Name the substring that causes this revisitation w m so that w may be 
rewritten as w v w m w s (where w p is the prefix of w f w m is the middle of w, and w s is the suffix 
of w). We can remove w m from w and the resulting string w v w s is accepted by M and has at 
most k state revisitations, so inductively, w v w s can be generated by G. Now, by our choice of 
Wp induces a cycle- free transition from qo to some state q and w m induces a cycle from 
q to q which does not itself properly contain any cycles. But this implies that step 2 of the 
algorithm added the production w v -* w p w m to G. Thus we can apply this production to the 
string w p w 3 to obtain w p w m w s = w. Therefore w 6 L{G), which completes the induction and 
the proof. □ 

This completes the proof that the regular languages are contained in the prefix languages. 

10 A state revisitation occurs whenever any state that has already been visited is visited again. Thus a string 
which produces the state sequence y 0 , tf4 , </3, tfo, ?3 causes 2 state revisitations - I revisitation of the state q 0 and 
one revisitation of the state ^3. 




6-8 Applications to Lear liability 



It follows from Theorem 84 that H m% \ can be learned in exponential time in terms o/DFAs with 
equivalence queries and membership queries 20 using Angluin's [3] L* algorithm. 21 We should 
emphasize that this result does not subsume the learnability result given earlier for W2,i (it 
clearly does not subsume the result for %2,k ) for at least two reasons. First, the earlier result is 
for learning in polynomial time. Second, the earlier result does not require membership queries. 

An interesting open question is whether there exists a polynomial time transformation from 
strings to strings such that for every prefix grammar there exists a DFA, rather than right-linear 
grammar, only polynomially larger than the prefix grammar that accepts the transformation 
of language generated by the grammar. 22 If so, H mf \ can be learned in terms if DFAs, and 
therefore predicted, with membership and equivalence queries. If, on the other hand, it can De 
shown that there exists a polynomial size prefix grammar for any NFA, then predicting H mt \ is 
as hard as predicting NFAs, which cannot be done (under certain cryptographic assumptions) 



The concept classes studied in this chapter are incomparable to — that is, neither subsume nor 
are subsumed by — other classes of definite clause theories whose learnability, in the PAC or exact 
models, has been investigated [41, 82]. 23 Page and Frisch [82] investigated classes of definite 
clauses that may have predicates and functions >f arbitrary arity but explicitly do not have 
recursion. In that work, a background theory 24 was also allowed; allowing such a theory in the 
present work is an interesting topic for future research. Cohen [35] investigates the learnability 
of function- free, two clause, closed, linearly recursive, ij -determinant logic programs, given a 
background theory. His positive result appeared at the same time as this chapter [43], and 



20 A membership query asks whether the target concept classifies a particular example as positive or 
negative. 

21 We say in this case that 7^2,* can be predicted using membership and equivalence queries. 

22 It can be shown that if we demand that the transformation be the identity transformation, i.e., that the 
DFA accept exactly the set of strings generated by the grammar, then the answer to this question is * l no !> . 

"Theoretical work related to the learnability of definite clauses in other models includes Shapiro's [95] work 
on learning in the limit, Ling's [70] investigation of learning from good examples, and the work in Chapters 2 
and 3 of thi3 thesis on pro positional Horn clause theories. 

24 A background theory essentially represents what the learner already knows about the world. 
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both are distinctive in that they seem to be the first positive learnability results for recursive 
theories for which the learner is not allowed to ask about the classification of specific examples. 
Cohen [34] gives a negative result when the linear recursiveness restriction is removed and the 
target consists of a single clause. 

Dzeroski, Muggleton, and Russell [41] investigated the learnability of classes of function-free 
determinate fc-clause definite clause theories under simple distributions, also in the presence of 
a background theory. 25 This class includes recursive concepts; to learn recursive concepts, the 
algorithm requires two additional kinds of queries (existential queries and membership queries). 
Rewriting definite clause theories that contain functions to function-free clauses allows their 
algorithm to learn in the presence of functions. Nevertheless, the restriction that clauses be de- 
terminate effectively limits the depth of function nesting; their algorithm takes time exponential 
in this depth. So, for example, while the algorithm can easily learn the concept even integer, 
or multiple of 2, from W24 —[even(Q)} A [even(x) even(s(s(x)))]— the time it requires grows 
exponentially in moving to a concept such as multiple of 10 or multiple of 1000, also in W 2fl . 
It is easy to show that the classes W 2 ,i, Km, W2,*, and W 2 ,*, rewritten to be function-free, are 
not {i,.7'}-determinate, for any i and j. 

Finally, Dzeroski, Muggleton, and Russell [41] obtain their results by a transformation 
to a concept class in propositional logic, and they note the importance of finding learnable 
classes of first-order concepts that have no such transformation. While we have been unable to 
find any such transformation for the classes presented here, showing that a class has no such 
transformation appears to be a difficult task, and also one we have been unable to accomplish. 
In addition, Section 6 indicates that it may be important as well to determine whether a 
learnable class of first-order concepts has such a transformation to a class of formal languages 
or automata, particularly when recursion is present. That section also leads one to speculate 
about whether various classes of first-order concepts may even be interesting hybrids between 
concepts of propositional logic and formal languages or automata. The relationship of concept 
classes described in this chapter to other classes is also an interesting area for further research. 



*The restriction to simple distributions is a small one that quite possibly ran be removed. 
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Chapter 7 



Description Logics 

Description logics, also called terminological logics, are commonly used in knowledge- based 
systems to describe objects and their relationships. This chapter investigates the learnability 
of a typical description logic, CLASSIC, and shows that Classic sentences are learnable in 
polynomial time in the exact learning model using membership queries (which are in essence, 
"subsumption queries"). It is shown that membership queries alone are insufficient for poly- 
nomial time learning of CLASSIC sentences. Combined with earlier negative results of Cohen 
and Hirsh [33] showing, given standard complexity-theoretic assumptions, that random exam- 
ples alone are insufficient in the PAC setting, this shows that both sources of information are 
necessary for efficient learning in that neither type alone is sufficient. In addition, it is shown 
that a modification of the algorithm deals robustly with persistent malicious two-sided classifi- 
cation noise in the membership queries with the probability of a misclassification polynomially 
bounded below 1/2. This fact that this first order language can be learned in full is surprising; 
even more surprising is the robust error tolerance. 

7*1 Introduction 

A central problem facing automated Information Systems is knowledge acquisition (eg. [76]). 
Whether dealing with expert systems, database constraints, or deductive databases, applying 
knowledge demands that knowledge be at hand. Unfortunately, extracting knowledge from 
a domain expert results in either an extremely narrow base of knowledge, or an enormous 
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amount of potentially buggy knowledge, or both. We address the problem of efficient knowledge 
acquisition from the vantage point of computational learning theory. 

Traditionally, computational learning theory has focused on propositional domains. We in- 
vestigate learning in the first-order domain of description logics or terminological logics. Specif- 
ically we consider the learnability of the description logic known as Classic [39], To the extent 
that Classic is a typical description logic, our results generalize to a variety of other such 
logics. 

Description logics are more expressive than the propositional calculus. A description logic 
statement is essentially a first-order predicate calculus formula in which all but one variable is 
quantified. Therefore, the meaning of a statement in a description logic, instead of being either 
true or false for a given interpretation, is the subset of the universe satisfying the statement. 
For example, suppose that the universe is a set of dogs, brown(x) asserts that x is brown, and 
smaller(x,y) asserts that y is smaller than x. If it happens to be the case that Rex is the 
only shaggy dog and Fido is the only brown dog, then brown(x) A smaller(x,y) is a well- 
formed description logic statement denoting the set {Fido} provided Fido is the largest dog in 
the universe; otherwise the empty set is denoted. Likewise brown(x) A 3haggy(x) denotes the 
empty set. Neither statement is a closed formula in the predicate calculus and neither statement 
has an associated truth value. Thus, description logics have a different flavor than the predicate 
calculus. Description logics comprise "natural" classes of formulas in that description logics are 
used in the field of knowledge representation [17, 28, 25, 26, 39, 68, 77, 83]. 

7.1.1 Classic 

Essentially, Classic permits constructing certain quantified descriptions which distinguish a 
particular subset, of a domain / of individuals. Classic descriptions contain primitive symbols 
which get mapped to arbitrary subsets of f, disjoint primitive symbols which get mapped to 
mutually disjoint subsets of f , roles which get mapped to binary relations on f, and attributes 
which are roles that happen to be functions. Further, Classic sentences contain constructors 
which manipulate these primitives, roles, and attributes, in order to permit the denotation 
of complicated subsets of /. The following synopsis and semantics of Classic is excerpted 



from [26, 36, 33]. 
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(SAME-AS (r u ...r Ul ) (r 2>1 



r 2Ai)) denotes the set 



of individuals for which composing the first chain of attributes is the same as composing 
the second chain of attributes. 

(ALL r D) denotes the set {x : Vy [r(ar, y) — ► D(y)]} of individuals for which all of the r- 
related individuals satisfy description D. 

(AND D\ . . . D n ) denotes the set {x : Di(x) A • • • A D n (x)} of individuals that satisfy all of 
the descriptions Z?i, . . . , D n . 

(AT-LEAST n r) denotes the set {x : \{y : r(x,y)}\ > n} of individuals having at least n 
r-related individuals. 

(AT-MOST n r) denotes the set {x : \{y : r(x,y)}\ < n) of individuals having at most n 
r-related individuals. 

(PRIM pi) denotes the subset of individuals denoted by the primitive symbol pi (provided by 
the interpretation). 1 

(FILLS r pi . . .p n ) denotes the set {x : 3yi G Pi such that r(ar, tjx) A • • • A r(x, y n )}, where the 
Pi are disjoint primitive symbols. 

(ONE-OF pi . . .p n ) denotes the set U-L^j, where the pi are disjoint primitive symbols. 

Descriptions are built from the individuals, primitives, and other descriptions. For example 
if our set of individuals is the set of all dogs and breeds 2 and we have at our disposal the 
primitive concept brown for the set of brown dogs, the role smaller for comparing dog sizes, 
the attribute breed for denoting breeds, the attribute father for associating a dog with its 

l In our illustrations we omit this formalism and use descriptions such as brown to denote the primitive set of 
things which are brown. 

2 Keep in mind that these two different "types' of individuals are indistinguishable within a description logic- 
statement; it is the venue of the rxternaLly supplied meanings of the roles and primitives to preserve any intuitive 
distinctions we may have concerning these different "types" of individuals. 
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father, the attribute mother for associating a dog with its mother, and the role school denoting 
the obedience school classmates, then if we wished to denote the set 

{x : Vy school(ar, y) — »■ [brown(y) 

A \{z : smaller(y,z)}| > 20 

A breed(mother(y)) = breed(f ather(f ather(i/))) ] } 

of dogs all of whose obedience school classmates were brown, larger than at least twenty other 
dogs, and had mother and paternal grandfather of the same breed, we would write: 

(ALL school (AND brown 

(AT-LEAST 20 smaller) 

(SAME- AS (mother breed) (father father breed)))) 
7.1.2 The Learning Problem 

The meaning of a description logic statement depends on a particular interpretation. It is a 
set selector: Given a choice of a universal set of individuals /, an assignment of the primitive 
symbols (such as brown) to subsets of /, an assignment of the roles to binary relations on /, 
and an assignment of attributes to functions from / to /, the statement denotes the set of 
elements x in / that cause the corresponding first order expression to evaluate to true, given 
the semantics on the previous page. 

One way to define a positive example of a Classic sentence is as an individual that is 
in the denotation of the sentence. One drawback of this definition is that the classification 
of an example depends on a particular interpretation, and not just on the target concept. 
Further, the classification may depend on the relationships among a potentially infinite number 
of individuals from the universe. An alternative definition, taken by Cohen and Hirsh [33], and 
supported in the description logic community [25, 24, 40], is to define a positive example to be 
the description of an individual (using the sair.e description logic) that is a positive example for 
every possible interpretation. Because such a description may not select a unique individual for 
all interpretations, each such (positive) example is actually a concept itself, whose denotation is 
a subset of the denotation of the target concept. Thus, C is a positive example if C ¥ subsumes C . 
This viewpoint is also supported by previous work in inductive logic programming [82, 41, 43] 
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and learning from "entailment" [46, 2], where positive examples of an unknown (typically first- 
order) formula are clauses or other formulas that are entailed by the unknown formula. 

We employ the standard protocol of learning from equivalence and membership queries. Our 
algorithm may conjecture any CLASSIC description H , and is told whether or not H is equivalent 
to the target description C+ (i.e., has the same denotation for all possible interpretations). If 
H is not equivalent to C+ y then the algorithm receives a counterexample, which is a positive 
example of one of H and C*, but not both. The algorithm also may present a description C, 
and is then told whether or not C is a positive example of C+ (i.e., whether or not C+ subsumes 
C). Of course, a standard transformation turns this algorithm into a PAC with membership 
query learning algorithm. 

The algorithm takes advantage of the fact that such queries C may be arbitrary concepts, 
so perhaps it is more appropriate to call these "subsumption" queries, or even "subset" queries. 
The distinction between these notions is lost given the common use of the "single representation 
trick" in AI - where it is often convenient and desirable to represent concepts and examples using 
the same language. In fact, Cohen and Hirsh [33] note that in many implemented description 
logics, "it is possible to attach an arbitrary description to an instance [example], hence the 
distinction between instances [examples] and concepts is blurred." 

Because the subsumption relation is computable in polynomial time [36, 33], both member- 
ship and equivalence queries are efficiently computable by a teacher. The main results of the 
chapter are now summarized: 

Theorem Classic is learnable in time polynomial in the size of the target description and the 
length of the longest counterexample, using membership (subsumption) and equivalence 
queries. 

Theorem Any algorithm using membership (subsumption) queries alone requires a number of 
membership queries that is exponential in the size of the target concept. 3 

3 Thus the positive result does not come solely from the membership queries. Cohen and Hirsh (33] showed 
that CLASSIC is not learnable in polynomial time (without membership queries) in the PAC model (assuming 
RP 7^ NP), hence CLASSIC cannot be learned from equivalence queries alone given the same assumption. Thus, 
neither membership nor equivalence queries arc dispensable - they form a minimal set of learning queries for 
Classic. 
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Theorem Classic remains learnable in the exact learning model with equivalence and mem- 
bership queries, even when each answer to a membership query may be answered incor- 
rectly by a malicious adversary with probability 1/2 - 1/r, where r is any polynomial 
function of the size of the target concept. 4 

In particular, when a membership query is asked, an adversary flips coins, and with 
probability 1/2 - 1/r, may choose to classify incorrectly. (Any future queries on the same 
example are results in the same classification). We present a modification of our algorithm 
that will work when r is any polynomial function of the size of the target concept. To 
our knowledge, this is the first algorithm for any concept class capable of coping robustly 
with such errors. 

7.1.3 Comparison to Previous Results 

Automating propositional explanation discovery has been well studied [7]. In comparison, effi- 
cient first-order learnability has been less well studied. Even so, some results are known. Co- 
hen [35] gives a PAC learning algorithm for function-free, two clause, closed, linearly recursive, 
ij-determinant logic programs; he goes on to show [34] that when the condition linear recur- 
siveness is relaxed, the learning problem becomes cryptographically hard. Page and Frisch [82] 
show that constrained atoms (a typed logic) are efficiently learnable, Frazier and Page [43] 
provide a learning algorithm for a syntactically restricted subclass of first-order Horn formulas, 
and Dzeroski et al. [41] provide a learning algorithm for a different restriction of first-order 
Horn formulas. 

Haussler [57] investigated existentially quantified conjunctive concepts and describes a graph 
representation for those concepts. He showed that learning some very simple scene descriptions 
is difficult. Specifically, he showed that even restricted to unary atoms such concepts are not 
learnable from random examples unless RP = /VP, but did give a learning algorithm for settings 
where the algorithm may use a richer vocabulary than that from which the target was chosen. 
Indeed, positive first-order learning results appear to be quite rare for "natural" classes of 
first-order formulas. It would seem that the difficulty of the learning task he faced revolved 
around the ambiguity admitted by the graphical representation required to capture existential 

4 The errors arc persistent, so that the algorithm may not benefit from repeatedly asking the same question. 
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quantification in the concept class he investigated; our concept class does not permit existential 
quantification. It will be seen that the graphs we use suffer no such ambiguity, thus we are able 
to avoid the difficulty he faced. 

The work most closely related is that of Cohen and Hirsh [33] who employ a graphical 
representation for Classic concepts developed by Borgida and Patel-Schneider [26]. To explain 
their results, and present ours, we briefly explain the notion of a labeled equivalence graph (called 
a concept graph in [26, 36, 33]). 
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Figure 7.1: A labeled equivalence graph. 



Consider the graph in figure 7.1. This is a graphical depiction of the Classic description of 
the set of individuals who have at least one brother and at most two sisters, whose best friend 
has brown hair, who are their best friend's attorney, and whose best friend only has brown, 
shaggy dogs that have at least twenty puppies. The cycle in the graph also asserts infinitely 
many other SAME- AS conditions - for example, conditions about the best friend's attorney's 
best friend. 

Formally defined in Section 3, a labeled equivalence graph is a rooted, directed, vertex- and 
edge-labeled graph. Further, no vertex has two identically labeled outgoing edges. The edge 
labels represent binary relations over the universe of individuals, and an edge demands that all 
individuals in the image of the relation satisfy the constraints asserted by the vertex to which 
the edge leads. Also, any pair of disjoint directed paths between a pair of vertices involve 
only binary relations which are in fact functions. This pair asserts that the individual selected 
along one path must be the same as the individual selected along the other path. We associate 
a subset of the set of individuals / with each vertex in the graph — the subset of individuals 
satisfying every constraint - whether vertex label, edge label, or assertion of equivalence at 
some reachable vertex - asserted by the graph. The set of individuals denoted by the graph 
is exactly the set of individuals associated with the root. Note that the presence of directed 
cycles in the graph, and in particular, those involving the root, implies that the root concept is 
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being defined in terms of other concepts, . . . , which are in turn being defined in terms of the 
root. Thus, cycles allow co-referential definitions. 

Polynomial time algorithms exist for transforming labeled equivalence graphs into Classic 
sja-onces, and vice-versa [26, 36, 33], although not all labeled equivalence graphs correspond 
to valid Classic sentences. Thus, the question of learnability of Classic sentences more or 
less reduces to that of learning such equivalence graphs. What would a positive example of 
an equivalence graph look like? It is another graph which satisfies all of the constraints (and 
perhaps more) represented by the first. The subsumption algorithm for Classic essentially 
verifies that the vertex label reached by a path in the first graph is less restrictive than the 
label of the corresponding vertex (which must exist) in the second graph, and that if in the 
first graph the two paths labeled by strings w\ and w<i lead to the same vertex, then this occurs 
in the second graph as well. For example, if we add an edge or vertex label to the graph in 
Figure 7.1 we obtain a positive example. Conversely, deleting an edge or vertex label from 
Figure 7.1 produces a negative example. The hard part of the learning problem is to determine 
the structure of the graph and the edge labels, not to determine the vertex labels. Thus, 
most of the constructors from the Classic language are not problematic; the main challenge 
is presented by the SAME- AS conditions (each represented by a disjoint pair of paths between 
two vertices), and the role and attribute s (which are edge labels). Henceforth we assume that 
all the vertex labels are irrelevant; in Section 7.4 we show how the algorithm is modified when 
this is not the case. 



Figure 7.2: The universally positive example for equivalence rraphs. S indicates that every 
possible edge label a € £ appears on a directed edge from the loot vertex back to itself. 

A natural first attempt to learn such graphs would be to simply intersect the graphs which 
represent positive examples, thereby extracting the set of common vertices and edges. However, 
since the universal positive example (figure 7.2) contains every path (and thus is subsumed by 
any target), simply intersecting the graphs will not be enough. What does succeed is a variant 
of the "cross-product of I)FAs M construction for regular language intersection. Uy repeatedly 
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taking the cross-product of positive examples, a (one-sided) learning algorithm is obtained. A 
problem with this approach is that the cross-product of two equivalence graphs can be as large 
as the product of their sizes; the repeated cross-product necessary to implement this approach 
may yield exponentially sized hypotheses. 

Cohen and Hirsh [33] circumvent this problem by restricting the number of distinct paths 
through the graphical representation of a Classic concept. Given a constant k they consider 
graphs G having at most \G\ k distinct paths (hence their graphs are acyclic). Denote this class 
At-Classic. They show that the intersection approach above yields aO(m fc+l ) mistake-bounded 
one-sided learning algorithm for A:-Classic, assuming all counterexamples have size at most m. 
Negatively, they show that in the PAC learning model, assuming that RP ^ NP, Classic is not 
learnable from random examples alone, even if either of the following constraints hold: (i) the 
primitive class alphabet is singleton, the role alphabet is doubleton, and the equivalence graph 
of every example is acyclic, or (ii) the primitive ciass alphabet is singleton, and the equivalence 
graph of every example contains only two vertices. 

7.2 The Algorithm 

Simply stated, our algorithm employs the one-sided approach of graph cross-product discussed 
above, but uses membership queries to bound the intermediate hypothesis size. Figures 7.3 
and 7.4 give the learning algorithm. The cross-product G X H of labeled equivalence graphs G 
and // is described in the next section, as is the argument that the algorithm efficiently learns. 
At first glance, equivalence graphs seem DFA-like, but their semantics are quite different, so 
well-known DFA learning algorithms [3, 89] do not apply. 

Learn 

1 Let // be the universally positive example 

2 H := Prune(tf) 

3 While EQUIVALENT^/) provides counterexample G 

4 G := Prune(C) 

5 // := Prune(6'x //) 

6 Return // 



Figure 7.3: Kquivalence graphs learning algorithm. 
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Prune(G) 

/* G is a positive example. */ / 

1 For each edge e in G 

2 • If MEMBER(G\e) a is "yes", then remove e from G 

3 Return G. 

° Along with e remove any unreachable component. 

Figure 7.4: Algorithm using membership queries to remove excess graph elements from a 
positive example. 

7.3 Equivalence Graphs 

As discussed above, the learning problem for CLASSIC sentences is closely related to that of 
learning labeled equivalence graphs. We first consider equivalence graphs without vertex labels, 
and then indicate how the algorithm is modified to the more general case in Section 7.4. Later, 
in Section 7.5, we modify the algorithm again in order to learn Classic. 

Definition 87 Let E be a finite alphabet. A rooted, directed, edge-labeled graph 7 is an equiv- 
alence graph over E if each vertex v in G is reachable from the root and for every symbol a in 
E, v has at most one outgoing edge labeled a. The size |G| of an equivalence graph is the sum 
of the number of edges and vertices in G. 

A string w of E* is G -supported if w is the concatenation of symbols on the edges of a 
rooted, directed path in G. G defines an equivalence relation =g on strings of E* as follows: 
Wy =a u>2 iff both w\ and W2 are G-supported, and their paths terminate at the same vertex. 
Thus, a G-unsupported string is not G-equivalent to any other string. The set of all strings 
G-equivalent to a string w is denoted and, by an abuse of notation, the set of all strings 

that terminate at a vertex v of G is denoted \v\ G . It is easily verified that for any equivalence 
graph G, =g is a right-invariant equivalence relation on strings, and that if w is G-supported 
then so is every prefix of w. 

We define a partial order on equivalence graphs based on which strings are supported and 
which strings are equivalent: 

Definition 88 CJ\ subsumes G-2 if every ('! [-supported string is G2-supported and every pair of 
G\ -equivalent strings are G2'€quivalent. 
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Examples will be labeled according to their relationship to the target under this partial ordering. 
Positive examples of an equivalence graph G will be exactly those graphs G f which are subsumed 
by G. 

Definition 89 ([33]) Let G x = {V U E{) and G 2 = {V 2 ,E 2 ) be two equivalence graphs. T'e 
cross-product of G\and G2, denoted G\ X G2, is defined as follows. Let p\,P2j- • - iP\Vi\ denote 
the vertices of Gi with p\ denoting the root, and let q\ 7 q 2j . . . ,q\v 7 \ denote the vertices of G2 
with qi denoting the root. If G\ or G2 is empty, then G\ X G 2 is empty. Otherwise, the vertex 
set of Gi X G2 (a subset ofV\ X V 2 ) and the edge set of G\ X G2 are defined recursively: 

• The graph G\ X G2 has a root denoted (p\ 7 q\). 

• The graph G\ X G 2 has a vertex denoted (pi 2 ,qh) and edge (pi^qh) (Pi 2 ><lh) iffGi 
has edge p {l p i2 and G 2 has edge q^ q j7 and G\ X G2 has the vertex denoted 

Note that G\ X G2 is an equivalence graph whenever both G\ and G2 are equivalence graphs. 
The following properties of G\ X G2 are either easily verified, or follow from [36]. 

Property 90 

1. A string w is (G\ X G2)-supported iffw is both G\-supported and G 2 -supported. 

2. For any strings s and t, s = Gl x G 2 t iff both s = Gl / and s =g 2 t. 

3. Gi x G2 is the most specific generalization (least upper bound) of Gi andG 2 with respect 
to the subsumption ordering. That is, if G\, G 2> and G are equivalence graphs, then if G 
subsumes both G\ and <7 2> then G subsumes G\ X 6^- 

Definition 91 An equivalence graph G is said to be pruned with respect to an equivalence 
graph G+ if G+ subsumes G, but does not subsume any proper subgraph of G. 

» 

The following is a useful property of pruned graphs. 

Property 92 Let G and G M be two equivalence graphs such that G is pruned with respect to 
G*. Then for every vertex 0 in G and every outgoing edge label a from [v] G contains some 
(G\ -supported) string s such that sa is G ^-supported. 
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Proof Sketch: Suppose to the contrary that G contains a vertex v with outgoing edge label 
<r such that contains no string s such that scr is (7*-supported. But then deleting v's 
outgoing edge labeled o produces a proper subgraph of G that supports every (7*-supported 
string and leaves equivalent every pair of (7*-equivalent strings. This contradicts the hypothesis 
that G is pruned. □ 

We need the following technical lemma. 

Lemma 93 Let G\, G2, and G m be equivalence graphs such that both G\ and G2 are pruned 
with respect to G m . If there exists a vertex v of G\ x G2 such that {v] Gi xG ^ contains only 
G ^-unsupported strings, then there are two G m - supported strings that are G\-equivalent but not 
(G\ x G2)-€quivalent. 

Proof: First observe that if G m supports no strings at all - not even the empty string - then 
(?*, (?i, and G2 must each be the empty graph. In this case the lemma holds trivially. The rest 
of the proof assumes that at least the empty string is supported by G m and therefore also by G\ 
and G2 by Property 90 item 1. The proof assumes the existence of a vertex v of G\ X (7 2 such 
that [vJgj x g 2 conta i ns on ly G^-unsupported strings and then constructs two (7*-supported 
strings s and S* that are (^-equivalent but (Gi x (?2)-inequivalent. 

Let v be a vertex of G\ x G2 such that {v] Gi x q 2 contains only (^-unsupported strings, 
and let w be any string in {vj Gi x( ~ 2 . Now, since the G x x G 2 equivalence class containing 
the empty string contains a (7*-supported string (the empty string itself) and since {w\ Gi x G? 
contains no (7*-supported stings, there exists a prefix w p of w and an edge label a such that 

• w p cr is a prefix of it;, 

• I^pIgi x G 2 conta i ns a (/^-supported string, and 

• [wpO^Gi x £ 2 contains no (7,,-supported string. 

Let 5 be any <7*-supported string in lw p ] G{ xG7 . Now observe that since w is (G { x G 2 )- 
supported so are both w p cr and w p . Also observe that since w p <7 is (Gi X (7 2 )-supported w p a 
must be <7 r supported by Property 90 item 1. But then, by Property 92 since G { is pruned with 
respect to G m , |[tOp]] G| must contain a GVsupported string s' such that s'cr is also (7*-supported. 
Thus we have two (7,-supported strings 6* and s' such that s =c;, * a 2 w p (which by Property 90 
item 2 implies that s = Gl and s f = Gl w py and so s = Gl s'. Now since s'cr is (7*-supported, 
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if s =C\ x d $' then s' =a x x G 2 w p so that 1^^!^ x $ contains the G^-supported string $ f a, 
contradicting the choice of w p and a. Thus s £q x x q 7 $' '. Therefore we have constructed two 
<7*-supported strings that are G\ -equivalent but are (G\ X G^-inequivalent. □ 

The proof that the learning algorithm is correct and efficient (Theorem 96) will follow 
easily from a technical lemma (Lemma 95), which asserts that progress is made with each new 
hypothesis of the algorithm. The proof of the lemma follows the proof of Theorem 96. We first 
need the following definition: 

Definition 94 Let G be an equivalence graph that G* subsumes. Then is the equivalence 
relation =c restricted to G m -supported strings. 

Lemma 95 Let G\ and G 2 both be pruned with respect to G m , and let G = Prune(Gi x G 2 ). 
Further, suppose that Gi does not subsume G 2 - Then is a proper refinement of 

Theorem 96 Let G+ be the target concept. The algorithm Learn finds an equivalence graph 
equivalent to G+, and at no point during execution does the running time exceed a polynomial 
in |E|, \G m \, and the size of the largest counterexample seen so far. 

Proof Sketch: The initial hypothesis has one equivalence class. A simple inductive proof 
shows that every hypothesis is a positive example. By Lemma 95, each counterexample causes 
the number of equivalence classes over supported strings to increase by at least one. Since the 
number of equivalence classes in the hypothesis is bounded above by the number of equivalence 
classes in the target 5 , the number of equivalence queries is bounded above by that number. 

Finally, observe that at each step, if G is the counterexample with the greatest number of 
vertices seen so far, the algorithm has made at most |(7*| 2 - \G\ • |E| membership queries, and has 
run for at most a number of steps that is polynomial in \G m \, |<5|, and |E|. This follows from the 
fact that at each step, if G : the counterexample having the greatest number of vertices seen 
so far. the number of membership queries used by Prune on // x G is at most 0(|E| • n H • n^) s 
where nu and are the number of vertices in // and (5, respectively. Since 7i// is bounded 

5 If some vertex of the hypothesis contained no G. -supported string, that vertex would have been pruned. On 
the other hand, if the number of vertices in the hypothesis containing G. -supported strings is greater than the 
number of vertices in G., then the hypothesis is a negative example because some pair of U. equivalent strings 
arc inequivalent in the hypothesis. 
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above by no. and since \H\ < |G*|, 6 the number of membership queries used by a single call 
to Prune is 0(|E| • |G,| • |G|), where G is the largest counterexample yet witnessed by the 
algorithm. 

□ 

G G 

Proof of Lemma 95: It is sufficient to show that = G * x q 2 is a proper refinement of = G *, 
since G is obtained by pruning from G x xG 2 , and no edges or vertices are added — only 
deleted, hence =g is a refinement of ~ Gl x g 2 - 

Now, =g x x G 2 is a refinement of =G r Further, since both are subsumed by G*, they both 
support every G*-supported string. Hence, ~ G [ x G ^ is a refinement of = G * . 

If the number of equivalence classes of = G [ x G ^ exceeds the number of equivalence classes of 
= G *, then the lemma is proved. Otherwise, since = G [ xGj isa refinement of = G *, the number 
of equivalence classes must be the same, and the classes must be identical. We show that this 
leads to a contradiction, thus proving the lemma. 

By Property 90 item 2, for any strings x and y, x =g x < G 2 V iff z ^G x V and x =g 2 2/» and 
since all three support all G*-supported strings, we have that for any G+- supported strings x 
and y, x = G * x G ^ y iff x = G * y and x = G * y. This, together with our assumption that the 
relations = G * x q 7 and = G * are identical, implies that the relations = G * and = G * are identical. 

By the hypothesis of this lemma, G\ does not subsume G2, hence there exist strings t\ and 
t 2 such that t\ = G , t2j but ti h* (Otherwise, G\ supports some t that G2 does not; but 
then t is Gi-equivalent to some G*-supported w, since G\ is pruned. But t is not (^-equivalent 
to tt>, since t is not supported in G2. In this case, let t\ = £ and £2 = w.) 

Clearly, both and h are supported in G\. 

Case 1: both *i and t<i are supported in G2. Since both t\ and t<i are supported in both 
Gi and G2, they are both supported in G\ X G2. Since they are not equivalent in G2, they 
are not equivalent in G\ X G<i- Let v be the vertex that ti and t<i both go to in Gi, and let 
v\ and 02 [v{ ^ V2) be the vertices that t\ and J 2 go to, respectively, in G\ X G2. Since Gi is 
pruned with respect to G«, there exists a G^-supported string w that goes to v in G\. If there 

6 For every edge of H that could not be deleted by Prune, there is some equivalence class, i.e., vertex, of G* 
to which that edge can be associated. Thus the number of edges of H is at most the number of edges in G*. so 
1*1 < 
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do not exist G*-supported strings Wi and w<i such that w\ and w 2 go to v\ and v 2 in G[ X G'2, 
respectively, then Lemma 93 applies, and we conclude that G\ and G\ X G2 are not equivalent 
on G*-supported strings, a contradiction. Since t\ and w\ are equivalent in G\ X G2, and since 
£2 and W2 are equivalent in G\ X G2, we must have the same equivalences in G\ (noting that G\ 
supports all four strings). But t\ and t% are equivalent in Gi, hence W\ and W2 are equivalent 
in Gi, transitively. But Wi and W2 are not equivalent in G\ X G2, contradicting our assumption 
that Gi X G2 and Gi define the same relation on G*-supported strings. 

Case 2: at least one of t\ and t<i is not supported in G2. Without loss of generality, assume 
*i is not supported by G2. Let ta be the shortest prefix of t\ such that t is G2-supported, but 
ta is not G2-supported. 7 

Both G\ and G2 support t, so consider the two paths that t induces in these two graphs; 

G ... G * 

we claim that the = G * equivalence class containing t is the same as the = G * equivalence class 
containing t. Suppose by way of contradiction that this is not the case. Now t — <j\<j<i • ■ m 0\t\j 
so look at the first i such that <j\<j<i- - *<j{ is contained in non-identical equivalence classes of 
=q* and = G *. Since G\ and G2 are pruned, the equivalence class containing <7i<r2 • ■ -c^-i must 
contain some G*-supported string w such that wcr : - is also G*-supported. 8 Now since the equiv- 
alence classes of = G * and = G * are identical, the = G * and = G * equivalence classes containing 
wai must be identical. But wai is both Gi- and G2-equivalent to <T\<T2 • ■ *<^ij contradicting 
the assumption that cr t - is the first i where <J\<j*i ■ • *cr| t | reaches non-identical equivalence classes 

* G j G ' G G 

in = G * and = G \ Hence t is in identical = G * and = G * equivalence classes. Now, since [f] G 
contained no outgoing edge labeled <r, no G*-supported string w in [t] G is such that wa is 
G*-supported. But then this means that the outgoing edge from p] G labeled a can be deleted 
from Gi, contradicting our assumption that G\ was pruned with respect to G*. □ 

7.4 Labeled Equivalence Graphs 

Here we consider extending the class of equivalence graphs to allow for iabeled vertices. The 
set of vertex labels is required to possess enough structure to allow computing what will be 

7 Such a prefix exists because the empty string is supported. 

8 If * is L, w is the empty string. Clearly the empty string must be contained in the equivalence class represented 
by the root of any equivalence graph, and since the relations = G * and = G * are identical, the empty string must 
be in identical equivalence classes in G'i and 62. 
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the unique most specific generalization (least upper bound) between any pair of vertex labels. 
Specifically, the structure we require is a finite lattice. The following definition supplies the 
notion we adopt. 

Definition 97 Let E be an alphabet, and let C = (T, 1, ^,U) be a finite lattice having partial 
order < over elements of the set V, having (unique) minimum element ±, and having the 
binary join operator U,- that returns the unique least upper bound of its two operands. Then an 
£-labeled equivalence graph over E is a graph G that is an equivalence graph over E in which 
each vertex v of G has been labeled with some 7 € I\ 9 

For any labeled equivalence graph G and any G-supported string w y let ta{w) denote the 
label of the vertex reached by w. We now modify the earlier subsumption-induced partial order 
for equivalence graphs to obtain a partial order on labeled equivalence graphs based on which 
strings are supported, which strings are equivalent, and which vertex labels are assigned to the 
supported strings. 

Definition 98 Let G\ and G 2 be two C-labeled equivalence graphs. Then G\ subsumes G2 if 
every G\-supported string is G2~supported, iftG 7 { w ) ^ ^Gi(w) for every G\-supported string w, 
and if every pair of G\-equivalent strings are G2-equivalent. 

It is shown in [33] that a Classic description subsumes a second Classic description iff 
the labeled equivalence graph of the first subsumes that of the second. Thus, positive examples 
of a labeled equivalence graph G will be graphs G 1 which are subsumed by G. 

We now modify the earlier cross-product definition for equivalence graphs to obtain the 
cross-product definition for labeled equivalence graphs. 

Definition 99 Let G\ = {V u Ei) and G2 = (^2, £2) be two C-labeled equivalence graphs. The 
cross-product ofG\ andG2 f denoted G\ x(? 2 , is defined as follows. Let pi,P2> . • . *P\Vx\ denote 
the vertices of G\ with p\ denoting the root, and let 91,921 • • •>9|v 2 | denote the vertices of G2 
with q\ denoting the root. If G\ or G 2 is empty, then G\ X G 2 is empty. Otherwise, the vertex 
set of G x X G2 (a subset of V\ x V2) and the edge set of G\ X G2 cire defined recursively: 

°U is a. collection of li niter lattices, then the tuples {(71 7 m ) : 7^ £ l\ } form a finite lattice by 

taking (L\ t l m > as the minimum element and taking (71,..., 7m) < (7! y' m ) exactly when 7, ^, 7,' for 

each t. We will use this observation later when we return to our discussion of CLASSIC. 
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♦ The graph G\ X G2 has a root vertex denoted (pi,<?i) labeled ^Gi(Pi) u ^G 2 (<7i) 

♦ The graph G\ x G2 also has a vertex denoted (pi 2 ,?j 2 ) labeled tc x [Pi 7 ) U ic 7 {qj 7 ) together 
with the edge {pi x ,qj x ) (i?i 2 , 9j 2 ) ij!f <?t /ias ecf#e p xi p t - 2 and 6*2 /ias ectye -2-* g j2 
and d X <?2 Aas Me vertex denoted {Pi iy qj v )' 

Note that Gi X G2 is an £-labeled equivalence graph whenever both G\ and G2 are equivalence 
graphs. 

Theorem 100 Let JL 6e Me minimum element of a finite lattice C having maximum chain 
length d and having polynomial time computable join operator U, and let S be a set of edge 
labels. Then the algorithm composed of figures 7.3 and 7.4 learns C-labeled equivalence graphs 
over S from membership and equivalence queries in time polynomial in |E|, d } the longest 
counterexample received, and the size of the target concept. 10 

Proof: The algorithm is modified only in that the sole vertex of the initial hypothesis is 
labeled by JL and the cross-product operator for labeled equivalence graphs is used; specifically, 
Prune does not change - only edge deletions are attempted. 

Observe that a counterexample G to H must be of one of the following (not necessarily 
mu tally exclusive) types - 

♦ Some string w supported by both H and G has < £g(w), 

♦ Some //-supported string is not G-supported/or 

♦ Some pair of //-equivalent strings are not G-equivalent. 

If the counterexample G was of either of the latter two types, then G is a counterexample to // 
based solely on their underlying (non- vertex-labeled) equivalence graphs. Therefore, Lemma 95 
applies to show that =p* line (/ /xC) is a proper refinement of =£\ This can happen at most as 
many times as there are vertices in G+. 

If the counterexample is only of the first type, then the underlying (nou-vertex-labeled) 
equivalence graphs of // and G (and therefore, Prune(// x G)) are isomorphic. As such, some 

10 At no point in the run of the algorithm will it have consumed time more than a polynomial in </, the 
size of the target concept, and the longest counterexample received to that point. 
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vertex label of H was generalized. Thus, this first type of counterexample can happen at most 
dnn < d • \H\ < d • \G*\ times between occurrences of a counterexample of the second or third 
type, where n/j is the number of vertices in H . 

A naive analysis assumes that in the worst case the label on every vertex must be updated 
d times between changes to the equivalence classes of the hypothesis. This produces a bound 
of 0(driQ J on the number of counterexamples received, where tig. is the number of vertices in 
G m . 

A more careful analysis recognizes that, until a collection of target equivalence classes were 
split, generalizing the vertex label for one of the equivalence classes generalized the label for 
every target equivalence class in the collection; thus every target equivalence class need only be 
isolated once and have its vertex label changed at most d times overall. This bounds the number 
of counterexamples by 0(dna m )- Thus the algorithm witnesses 0(d ■ \G m \) counterexamples to 
its equivalence queries. 

As in the case of (non- vert ex- labeled) equivalence graphs, the number of membership queries 
used by Prune on H x G is at most 0(|S| • n// • tiq), where n# and tiq are the number of 
vertices in H and G, respectively, so that the number of membership queries used by a single 
call to Prune is 0(|E| ■ • |G|), where G is the largest counterexample yet witnessed by the 
algorithm. 

Finally, since U can be applied in polynomial time, and since < for each hypothesis 
H , the class of labeled equivalence graphs are polynomial time learnable using membership and 
equivalence queries. □ 

7.5 Application to Classic 

The graphical representation of Classic used in [36] annotates the vertices of the graph with 
the AT-LEAST, AT- MOST, FILLS, ONE-OF, and PRIM constraints; thus these constraints 
make a natural choice for the vertex labels discussed in the previous section. Combining these 
different kinds of constraints into tuples that serve as the actual vertex label lattice C is also a 
natural choice, because ordering tuples < (*i , . . .> J*) exactly when every s t < t t is 

hi accordance with the notion of subsumption for Classic [36], 
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We would like to exploit this similarity to labeled equivalence graphs by employing a pre- 
diction preserving reduction using the labeled equivalence graph learning algorithm developed 
in the previous section. The reduction would use the polynomial time transformations be- 
tween Classic descriptions and their graphical representations developed by [26, 36] to turn 
the Classic description examples into a form suitable for Learn and to turn the graphi- 
cal representations queried and hypothesized by Learn into Classic descriptions suitable for 
examination outside of Learn. Unfortunately, the semantics imposed on the graphical represen- 
tation of Classic dissuade us from this black box approach; we will employ the transformations 
of [26, 36, 33] only after modifying Learn. 

Because no legal graphical representation of a Classic description has non-attribute roles 
participating in a SAME-AS constraint, we are prevented from constructing the suggested 
universal positive example for Learn. Thus we are forced to modify Learn so that it constructs 
only labeled equivalence graphs whose edge labels adhere to this restriction on equivalent strings 
over the role alphabet. 

In lieu of a universally positive example that satisfies every constraint of the target, we rely 
on the semantics of the AT-LEAST and AT-MOST constructors to build a Classic description 
that always denotes the empty set, which is guaranteed to be a subset (and therefore a positive 
example) of any target description. Concretely, let r be any role. Then 

(AND 

(AT-LEAST 1 r) 
(AT-MOST 0 r) 

) 

always denotes the empty set; such a concept is said to be inconsistent. Making an equivalence 
query on the graph of this concept will provide the learner with a positive counterexample, 
Go, that satisfies all the constraints of the target. The graph G 0 serves the purpose of the 
initial universal positive example // used as the initial hypothesis by Learn. Having solved the 
problem of the distinction between attributes and arbitrary roles in Classic, we now address 
another impediment to a direct application of Theorem 100. 

Most of the graphical annotations for Classic possess sufficient structure to apply Theo- 
rem 100 immediately. As an example of computing upper bounds for a set of PRIM vertex 
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label, suppose that we have one label {red, shaggy, obese) and a second label {brown, shaggy, 
graceful, obese} then the set intersection - {shaggy, obese} - is the most specific label that 
generalizes both of these labels. Thus, proper generalizations are made by removing some prim- 
itive from the set. Consequently, proper generalization can occur at most times, where 
V is the set of primitives. Unfortunately, the AT-LEAST and AT-MOST constraints as used 
in CLASSIC do not possess the structure required because they do not form finite lattices. 

The set of possible AT-LEAST constraints has the ordering (AT-LEAST k + j r) < (AT- 
LEAST k r) for non-negative integers k and j and role r; thus the AT-LEAST constraints 
preclude the existence of the minimum element we need to express our initial hypothesis. Note, 
however, that because the counterexample Go obtained above is a positive example that does 
not always denote the empty set, the AT-LEAST conditions appearing in Go finitely bound 
the AT-LEAST lattice. That is, any AT-LEAST constraint on the vertex of Go reached by 
any target supported string w is more restrictive than the corresponding AT-LEAST constraint 
reached by w in the target. Therefore, the lattice of plausible AT-LEAST constraints for w is 
finite, having as the minimal element the value reached by it; in Go- 

The set of AT-MOST constraints suffers from a slightly different problem; the most general 
AT-MOST constraint is the absence of an AT-MOST constraint, which is not an AT-MOST 
constraint at all. This situation allows an adversary to prevent us from ever determining 
certain targets, even after seeing infinitely many equivalence query counterexamples. To see 
this consider the target consisting of a root with no AT-MOST constraint for role r. Now 
consider a sequence of (positive) counterexamples that are simply a root with an explicit AT- 
MOST condition, where example i asserts (AT-MOST i r). After example i, Learn would 
hypothesize a single root vertex asserting (AT-MOST i r). Clearly, no positive example ever 
allows the AT-MOST constraint to be removed from the root. 11 

To overcome this difficulty, we use a standard technique of decomposing the concept class to 
be learned into a collection of subclasses indexed by some parameter m such that the union of 
these subclasses is the original concept class and such that there is an algorithm that given m 
learns the subclass indexed by m and that runs in time polynomial in m and the other relevant 
parameters of the class. This permits us to "guess" efficiently the smallest subclass index in 

11 This problem does not arise in the AT-LEAST constraint because (AT-LEAST 0) Ls semantic ally equivalent 
to imposing no AT-LEAST constraint. 
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which the chosen target lies - try a value of m and if the algorithm exceeds its polynomial time 
bound, restart the algorithm using m + 1. / 

The decomposition we use is the value of the largest integer appearing in any AT- LEAST 
or AT-MOST constraint in any vertex of the graph. Any graph in the subclass indexed by m 
must have no explicit AT-MOST constraints more general than (AT-MOST m), thus whenever 
we would infer a hypothesis with an AT-MOST constraint more general than this we simply 
erase the AT-MOST constraint altogether. 

The remainder of the Classic constructs used to annotate vertices of the graph are straight- 
forward. We assume predefined (finite) sets V of primitives, X of disjoint primitives, and 7£ of 
roles. The most restrictive PRIM annotation demands that individuals satisfy every primitive. 
In the worst case this annotation would be properly generalized times before obtaining 

the null constraint not demanding that individuals satisfy any primitive. The FILLS constraint 
behaves like the PRIM constraint, except over the disjoint primitives I. 

The ONE-OF constraint is most restrictive when no disjoint primitive is specified; as positive 
examples are seen that name disjoint primitives in a ONE-OF constraint, the named disjoint 
primitives are unioned to the current collection. The ONE-OF constraint of penultimate gener- 
ality is the constraint that permits individuals to satisfy any of the disjoint primitives; the most 
general ONE-OF constraint is simply eliminating the ONE-OF constraint altogether. Thus 
there are at most 0(|X|) proper generalization made for a ONE-OF constraint. 

We are now ready to apply Theorem 100 to Classic. 

Theorem 101 Classic is learnable from membership and equivalence queries in time polyno- 
mial in \1Z\, \X\, \7>\ f s, t, and m, where U is the set of roles, 1 is the set of disjoint primitives, 
V is the set of primitives, s is the number of symbols needed to write the target Classic de- 
scription, t is the number of symbols needed to write the largest counterexample description 
witnessed, and m is the largest integer appearing in any AT- LEAST or AT-MOST constraint 
in the target description. 

Proof: For the vertex labels V we chose tuples over the AT- LEAST, AT-MOST, PRIM, 
FILLS, and ONE-OF constraints the AND, ALL, and SAME- AS constraints are reflected in 
the structure of the graph and were handled in Section 7.3. There are AT- LEAST, \'IZ\ AT 
MOST, and \1Z\ FILLS components in the tuple one of each for each role in Tv.; there is also 
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one PRIM and one ONE-OF component in the tuple. Now, ordering the tuples in accordance 
with the Classic subsumption relation, we find that the longest chain in the subclass indexed 
by m has length 0(\U\ • m + \K\ • |X| + \V\ + \X\). Theorem 100 applies to each subclass. We 
can assume that the largest integer appearing in an AT-LEAST or AT-MOST constraint of the 
target is i until the time bound under this assumption is exceeded, at which point we restart 
assuming that 2i will upper bound the largest integer appearing in an AT-LEAST or AT-MOST 
constraint of the target. After at most Ig m restarts, our assumed upper bound will be sufficient 
and will be at most twice m. This "guessing" produces at most a factor of 0(lg m) slowdown 
over having known m from the outset. □ 

This theorem is somewhat unappealing in that it actually demonstrates only a pseudo- 
polynomial time algorithm for Classic. For any reasonable encoding of the integers the value 
of m is exponentially larger than the number of digits needed to express it; thus allowing 
time polynomial in m is allowing us, in a very real sense, time exponential in m. However, 
observe that the problematic constraints - AT-LEAST and AT-MOST - each form a total 
ordering. Thus, the pseudo-polynomial running time of the algorithm can be reduced via 
binary search to polynomial running time. Specifically, Prune will, using membership queries, 
perform independent binary searches for each AT-LEAST constraint of each vertex to minimize 
the value of each AT-LEAST constraint. 

Similarly, Prune will determine whether any AT-MOST constraint can be removed from 
any vertex, and for any constraint that must remain, Prune will maximize the integer value 
used to express the constraint. To determine whether a constraint can be removed, Prune 
tentatively removes the constraint and makes a membership query on the resulting graph, 
which will remain a positive example if and only if that AT-MOST constraint can be removed. 
If the result is a negative example, then Prune successively doubles the current AT-MOST 
constraint value and makes a membership query until a negative example is obtained. Then 
a standard binary search finds the largest integer value for this AT-MOST constraint that 
produces a positive example. The updated Prune is shown in figure 7.5. 

The m used in this new version of Prune is meant to suggest the index of parameterization 
used in Theorem 101. This modification reduces the time dependency from a polynomial in m 
to a polynomial in logm, so that we have the following stronger theorem asserting the fully 
polynomial time learnability of Classic. 
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Prune(G) 

/* G is a positive example. Each question about whether some alteration to G results 
in a positive example is answered by asking a membership query. */ 



1 For each edge e in G 

2 If removing 0 e from G produces a positive example, then remove e from G. 

3 For each vertex v 

4 ■ For each AT-LEAST constraint c of v 

5 Let m be the current value for c. 

6 Perform a binary search in the range [0, m] to find the minimum value for c for 
which G remains a positive example, and replace the current value for c with 
that value. 

7 For each AT- MOST constraint c of v 

8 If removing c from G results in a positive example, then remove c from G. 

9 Else 

10 Let m be the larger.of 1 and the current value for c 

11 While replacing m with 2m in c produces a positive example, replace m 
with 2m in c. 

12 Perform a binary search in the range [0,m] to find the maximum value for 
c for which G remains a positive example, and replace the current value for 
c with that value. 

13 Return G 



"Along with e remove any unreachable component. 



Figure 7.5: Updated Prune. 
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Figure 7.6: A target schema requiring exponentially many membership queries. 



Theorem 102 Classic is learnable from membership and equivalence 'queries in time polyno- 
mial in \7l\, \X\ t \V\ y s, and t, where 71 is the set of roles, X is the set of disjoint primitives, V is 
the set of primitives, s is the number of symbols needed to write the target Classic description, 
and t is the number of symbols needed to write the largest counterexample description witnessed. 

Proof: The running time of the algorithm depends only polynomially on log to for the largest 
integer m explicitly appearing in an AT-LEAST or AT-MOST constraint; this means that the 
algorithm depends only polynomially on the number of digits needed to represent the largest 
integer m explicitly appearing in an AT-LEAST or AT-MOST constraint. But the number of 
digits need to represent this m is certainly no more than s, the number of symbols needed to 
write the the target Classic description. Note also that the updated version of Prune obviates 
the "guessing" done in Theorem 101 to determine a suitable value of m ~ we have abandoned 
the parameterization by m completely. □ 



7.6 The Insufficiency of Membership Queries 

We show in this section that learnability cannot be achieved solely through membership queries. 
This, coupled with the result of Cohen and Hirsh [33] shows that membership and equivalence 
queries form a minimal set of queries with which Classic can be exactly learned. 

To this end, we now show that it is impossible to learn the class Classic sentences with 
simple corresponding labeled equivalence graphs in polynomial time using membership queries 
alone. To do this we exhibit a target for which an adversarial teacher will be able to force the 
learner to ask exponentially many membership queries before producing a concept equivalent 
to the target concept. 
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Theorem 103 Any algorithm that uses membership queries alone requires Q{2^) membership 
queries to distinguish between CLASSIC sentences whose equivalence graphs have the form shown 
in figure 7.6. 

*> 

Proof: Figure 7.6 is a template for 0(2' s ') distinct concepts based on the partitioning of 
edge labels 12 of £ into (disjoint but exhaustive) sets S\ and £2- A straightforward adversary 
argument shows that at least 2' 2 '"" 1 — 1 membership queries are still required: 

First, any membership query supporting all strings with a single equivalence class is a pos- 
itive example - no information is obtained by asking such a query. Second, any membership 
query that does not support some edge label from the root is .a negative example because a 
target-supported string is unsupported - membership queries of this form provide no infor- 
mation. Third, any membership query that partitions the set of edge labels emanating from 
the root into more than two sets is a negative example because some pair of target-equivalent 
strings are inequivalent - such membership queries provide no information in distinguishing 
among the possible targets. 

Thus, only membership queries that partition the set of length 1 strings into two supported 
equivalence classes can provide any information. Any query that does not partition the edge 
labels into exactly the same sets as the target is a negative example. There are 2' s '~ l such par- 
titions. The adversary simply answers any such query "no" until all but one of the partitionings 
have been exhausted. Notice that if the learner outputs some conjecture before exhausting all 
possible partitions, the teacher simply asserts that the conjecture is incorrect by choosing as 
the target any unexplored partitioning. □ 

Corollary 104 The class of unlabeled equivalence graphs (labeled equivalence graphs) ("CLASSIC 
descriptions) cannot be learned in polynomial time from membership queries alone, even when 
the target is known to be acyclic and contain at most three vertices. 

A similar result holds when there are only two roles (i.e., |E| = 2): 

Corollary 105 The class of unlabeled equivalence graphs (labeled equivalence graphs) (Classic 
descriptions) cannot be learned in polynomial time from membership queries alone, even when 
|S| = 2 and the target is known to be acyclic. 

12 To adhere to the semantics of CLASSIC, consider a collection of roles all of which arc attributes. 
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Proof Sketch: Simulate a set £' of edge labels with |£'| = n = 2 k by using the labels of E as 
a binary code to label a depth k - 1 binary tree so that ail length k - 1 strings are supported 
and every string of length k - 1 or less is in its own equivalence class. Now construct two more 
equivalence classes such that ^ery string of length k is in one of these equivalence classes. (See 
figure 7.7). These last two equivalence classes simulate the non-root equivalence classes of the 
target in Theorem 103, so that learning this target requires 2 n ~ l - 1 membership queries, but 
the target has only J^jZo 2 1 ' = 0(n) vertices. □ 



Figure 7.7: This schema requires exponentially many membership queries even though E is 
known to be the set {<7i,<t 2 }. 



this learning model, an adversary is allowed to maliciously perturb the probability distribution 
D on examples by substituting for an example chosen randomly according to D an arbitrary ex- 
ample with arbitrary label; the probability of this substitution occurring is called the malicious 
error rate [38]. A number of authors have investigated this and related models (for example, 
see [11, 16, 38, 64, 84, 90, 96, 93]). In this section we investigate the presence of malicious 
noise in the answers to membership queries and show that very high rates of such noise can be 
robustly tolerated. 

Our model of persistent malicious classification errors is as follows. Let r(-) be any poly- 
nomial, and let G be any graph. The first time a membership query is made on G the teacher 

(adversary) flips a coin that with probability \ -4- r- lands heads, where (7* is the target 

concept. If the coin lands heads, the adversary is permitted to answer the query incorrectly 





7.7 Membership Query Response Errors 



Valiant [99] introduced the notion of PAC learning in the presence of malicious errors [38]. In 
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if he chooses; however, if the coin lands tails the adversary must correctly answer the query. 
Thereafter, the answer to a membership query on G will be the same answer as was first given. 

The language of labeled equivalence graphs admits the random construction of a number 
of independent examples having the property that either all of them are positive examples or 
all of them are negative examples. Exploiting this property, we give a general test for verifying 
the answer to a membarship query in the presence of even a significant amount of "persistent 
malicious noise" in the answcvs to membership queries. 

For technical reasons, wo consider only reduced labeled equivalence graphs - labeled equiva- 
lence graphs in which every vertex with maximal label and outdegree 0 has indegree at least 2. 
Prohibiting maximal-label vertices with indegree 1 and outdegree 0 does not change the ability 
of labeled equivalence graphs to represent CLASSIC descriptions concisely; however permitting 
such vertices does illuminate a difference between the semantics of labeled equivalence graphs 
and the semantics of C LASSIC. A given labeled equivalence graph may be a negative example 
of the target due to the fact that some target string is unsupported but the vertex reached by 
this supported string imposes no other semantic constraint in terms of equivalence of strings or 
vertex label - in Classic such a graph would have arisen from some subexpression stating, "All 
individuals in the relation r to individuals in this set are individuals in the universe." Clearly 
such a statement is true for every role r and every individual in the universe; eliminating such 
a subexpression does not change the semantics of a Classic description, but the two labeled 
equivalence graphs representing these descriptions would be semantically different. We now 
claim the following result. 

Lemma 106 Let G* be the target, let r be any polynomial, let 6 > 0 be any probability, and let 
G be any positive example o/G*. //|S| > 1, then there is a polynomial time algorithm using 

membership queries with persistent malicious classification error rate \ p— p that determines 

with probability 1-5 whether deleting a given edge, e, from G produces a positive example. 

Proof: Let n Gm be the number of vertices in G*. Let k = 1 + n Gm + 6r 2 (n G Jln Choose 
a random string s from and construct a labeled path p from s using k unlabeled vertices 
where the last symbol of s is the label of a self-loop edge on the last vertex. Redirect the 
terminus of c to the beginning of this path. If e was deletable, then this new graph is a positive 
example. 
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If e was not deletable, then either it must be used to capture some equivalence constraint, 
expressed in the target or it must be used in a path that reaches some vertex label constraint 
expressed in the target. 13 In either case such a constraint must occur along a path of length 
at most tig. within the target, as that is the number of vertices in G*. However the first tig. 
vertices of path p express no constraint of any kind, so that redirecting e must cause some 
constraint of G* to be violated. Thus if e is not deletable, this new graph is a negative example. 

Now generate 6r(nc m ) (r(nG„) — 2) In | such variations of G being careful not to use the same 
string s. It is clear that every one of these variations is a positive example if e is deletable and 
every one of these variations is a negative example if e is not deletable. Ask a membership query 
on each of these variations. By Chernoff bounds, the probability that fewer than \ + 2r (n G ) 
of these queries are answered correctly is less than £. Thus, with probability 1 - £, e can be 
deleted if and only if the majority of these queries are answered "yes". □ 

We hasten to add that the assumption that the edge label set E have cardinality at least 
two is minor; the class of labeled equivalence graphs over an edge label set of size 1 can be 
easily learned using a straightforward dovetailing algorithm that guesses whether there is a 
cycle in the graph and if so, how long before the cycle begins. Vertex labels are modified only 
in response to equivalence query counterexamples and are thus not affected by the errors in 
membership query responses. 

We now have the main result of this section. 

Theorem 107 Let r be any polynomial and let 6 > 0 be given. Then the class of reduced labeled 
equivalence graphs is exactly learnable with probability 1 - So, m time polynomial in In S, d 
(the length of the longest chain in the lattice over which the vertex labels are defined), the size 
of the target concept, and the size of the longest counterexample received, from membership and 
equivalence queries, even with a persistent malicious classification error rate of \ - r(h.) ' 

Proof Sketch: Membership queries are used only in the Prune procedure of figure 7.4. 
Instead of relying on the answers, replace each such query with the procedure in the proof of 
Lemma 106, with parameter S = #o/s, where s is the total number of membership queries that 
the learning algorithm would make without noisy membership queries. Thus, the probability 
that any of the invocations of this procedure is incorrect totals at most S. If the size of the 

13 Tliis makes use of the assumption that the equivalence graph is reduced. 
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target concept and longest counterexample to be received is known a priori, then 5 is given 
by Theorem 96. If these values are not known, then the standard technique of dynamically 
guessing s can be employed while increasing the running time only poiynomially. □ 

7.8 Summary 

We have demonstrated a positive polynomial time learnability result using membership and 
equivalence queries for labeled equivalence graphs with vertex labels chosen from a finite lattice, 
and we adapted this algorithm to obtain a polynomial time algorithm for the "natural" first- 
order concept class Classic. We then showed that the learnability did not rest solely on the 
power of the membership queries by giving a non-learnability result for membership query 
only algorithms. Finally we showed that, although the membership queries are necessary, the 
accuracy in the answers to those queries need only be poiynomially better than 1/2. 

Another possible research direction centers around the joining of two graphs. When the join 
is computed, edge labels are deleted if the label does not appear in both graphs to be joined. 
What if the edge labels are chosen from a partially ordered set, where joining two edge labels 
might permit constructing an edge labeled with the least upper bound of the two edge labels to 
be joined? Because the edge labels capture functions in Classic, such generalizations appear 
to dwell in world of second order logics. For this reason we believe that a positive result in this 
direction will rely heavily upon the structure of the underlying partially ordered set. 
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Conclusion 



The goal of inquiry about automating the knowledge representation process is either to pro- 
duce a learning algorithm that efficiently automates the encoding of any representation that 
uses some useful representation languages C or to show that no such learning algorithm is 
possible. The centerpiece of this thesis was that there do exist learning algorithms for two 
natural representation languages: propositional Horn sentences and Classic. In addition, this 
thesis introduced a new method - consistently ignorant teachers - of modeling uncertainty in 
the information being collected. The thesis demonstrated that, by careful consideration of the 
task at hand, the tools that have been developed in the field of computational learning theory- 
can be used to automate the process of constructing the explanations required by real-world 
tasks in fields outside computational learning theory. This has been the foundation; it is hoped 
that by way of example, the potential of adapting computationally learning theory techniques 
to more ambitious domains has been demonstrated along with the considerations required in 
making such an adaptation. Adapting the tools of computational learning theory to other 
representation languages for many other domains lies ahead. 
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Appendix A 

Thesis Synopsis 



For expository purposes, we define a variety of characters that the example class might assume. 
To fully describe an example class we must specify its elements and the way that labels are 
assigned to elements of the example class by the concept class. Because the examples are 
intended to teach the learner the behavior of the target, we speak of the examples as being 
provided by a teacher. Thus, we associate a teacher with each kind of example class. 

Definition 108 (Standard Teacher) Let C be a concept class of propositional formulas, let 
X be the set of truth assignments over the variables used in C, and let C* be the target. A 
teacher who uses X and labels examples according to whether C* is satisfied is called a standard 
teacher. 

Definition 109 (Entailed Example Teacher) Let C be a concept class of propositional (re- 
spectively, first-order) logical formulas, let X be a class of propositional (respectively, first-order) 
logical formulas, and let C+ be the target. A teacher who uses X and labels examples according 
to entailment by C* (that is, whether every truth assignment satisfying C+ also satisfies the the 
example) is called an entailed example teacher. 

The next definition provides foi a new variety of teacher. The definition allows for teachers 
who are rational but not omniscient; that is, teachers who have knowledge gaps for which 
knowledge about the concept class does not help. 

Definition 110 (Consistently Ignorant Teacher) Let C be a concept class of propositional 
formulas and let X be the set of truth assignments over the variables used in C A consistently 
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ignorant teacher is a standard teacher who is permitted not to know the label of some example 
x provided that among all the concepts in C consistent with what the teacher does know, there 
are concepts C\ and C2 which disagree on the label of x. 

In practice, the amount of uncertainty held by a consistently ignorant teacher is measured 
by assuming that the teacher has some particular subset S of concepts from C in mind and 
declares an example to be positive if every concept in S calls the example positive, declares an 
example to be negative if every concept in S calls the example negative, and declares "I don't 
know" if some pair of concepts in S disagree about the labeling of the example. The learner is 
permitted time polynomial in the amount of uncertainty of the teacher. 

This work presents several results concerning modeling the world with propositional or first- 
order concepts. This work also presents alternatives to the idea that an example is a setting 
of the variables that is labeled according to whether it satisfies the unknown propositional 
description of the world. Specifically the following results have been obtained. 

• Positive Results 

1. Propositional Horn sentences are exactly learnable using equivalence and member- 
ship queries given a standard teacher (Theorem 23). 

2. Propositional Horn sentences are exactly learnable using equivalence and member- 
ship queries given an entailed example teacher using Horn clauses as examples (The- 
orem 40). 

3. The first-order class of conjunctions of pairs of definite clauses with fc-ary pred- 
icates and unary functions is exactly learnable using equivalence queries given an 
entailed example teacher using literals as examples (Theorem 76). Achieving this 
result produced a new characterization of the regular languages. 

4. Several classes of boolean functions are learnable using equivalence and membership 
queries given a consistently ignor nt teacher (Corollary 63). 

5. Sets of boxes in E d with samplable intersection are PAC learnable using membership 
queries and random examples given a consistently ignorant teacher using points as 
examples (Theorem 66). 
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6. Embedded multi- symmetric functions are exactly predictable using membership and 
equivalence queries from a standard teacher (Theorem 5). 

7. CLASSIC is exactly learnable using membership queries from an entailed example 
teacher using CLASSIC descriptions as examples, even when the membership queries 
are maliciously erroneously answered with probability bounded polynomially below 
1/2 (Theorem 107). 

Negative Results 

1. PAC learning propositional 2-quasi Horn sentences (conjunctions of clauses having 
at most two positive literals) using membership queries from a standard teacher 
is no easier than PAC learning arbitrary DNF formulas from a standard teacher 
(Corollary 27). 

2. Exactly learning propositional Horn sentences using equivalence queries from an 
entailed example teacher using Horn sentences as examples is no easier than exactly 
learning arbitrary DNF formulas using equivalence queries from a standard teacher 
(Theorem 45). 

3. The first-order class ft 2t * of conjunctions of pairs of definite clauses is not PAC 
learnable using random examples given an entailed example teacher using atoms as 
examples unless NP = RP (Theorem 77). 

4. Sets of propositional Horn sentences with samplable intersection are not PAC learn- 
able using membership queries and random examples given a consistently ignorant 
teacher unless arbitrary DNF are PAC learnable using random examples given a 
standard teacher, assuming the existence of one-way functions (Corollary 69). 

5. CLASSIC is not learnable using only membership queries from an entailed example 
teacher using CLASSIC descriptions as examples (Theorem 103). As a consequence 
of this result, random examples and membership queries form a minimal set of 
queries for PAC learning Classic from an entailed example teacher using Classic 
descriptions as examples. 
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IX SupfUaewser Heeee 



Computer automation of tasks is part of the natural progression of encoding information. 
AkilfeCIS When the task becomes well understood and repetitive, placing the task under computer control 

becomes a possibility. Computers were once programmed by rewiring rather than with the use 
of a modern program, management of a limited memory was once handled by the application 
programmer rather than by the operating system, and efficient use of the computer's hardware 
was once obtained by assembly language programmers rather than through a compiler, (n 
other areas, accounting moved from ledger books to spreadsheets, automobile fuel intake left 
the carburetor for computer-controlled fuel injection, and diagnosis and scheduling left the 
expert for the expert system. 

Current knowledge representation research, has sought to provide schemes for encoding 
knowledge about how a given system behaves, with the goal being accuracy and utility. Can an 
accurate description be given with the representation language being used? Can the resulting 
representation be manipulated easily to answer questions about the system being described? To 
the extent thai both questions can b« Answered affirmatively for some reutesentalion language 
« £, encoding information using C is well understood. Ideally, the goal of encoding knowledge is 

not the task of encoding, but the product of the encoding task. If such encodings are required 
for a variety of systems, then question of automating the process of encoding arises. 

This thesis considers this automation process to be a question of whether it is possible to 
automatically learn the encoding based on the behavior of the system to be described. A variety 
of representation languages C are considered, as are a variety of means for the learner to acquire 
a variety of types of data about the system in question. The learning process is abstracted as 
a learning problem in which the goal is to collect efficiently sufficient information to identify 
some hidden concept C represented using the language C. The source of information about C 
is its relationship to some class of examples X that is assumed to be reasonably available even 
though C itself is not. In addition to conjecturing guesses as to the identity of C, the learner 
is permitted ask how C relate* to individuals 16^. 

The goal of inquiry about this automation process is either to produce a learning algorithm 
that efficiently automates the encoding of any representation that uses some useful representa- 
tion language C or to show that no such learning algorithm is possible. The centerpiece of this 
thesis is that there do exist learning algorithms for two natural representation languages: propo- 
sitional Horn sentences and the Classic description logic. In addition, this thesis introduces a 
new method - consistently ignorant teachers - of modeling uncertainty in the information being 
collected. The goal of this thesis is to demonstrate that, by careful consideration of the task at 
hand, the tools that have been developed in the field of computational learning theory can be 
used to automate the process of constructing the explanations required by real-world tasks in 
fields outside computational learning theory. 

17. Key fords io4 Docusmm feolyeiT 17* Deoctipcote 
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